Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/22889#discussion_r230571437
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
join(right, Seq(usingColumn))
}
+ /**
+ * Equi-join with another `DataFrame` using the given column.
+ *
+ * Different from other join functions, the join column will only
appear once in the output,
+ * i.e. similar to SQL's `JOIN USING` syntax.
+ *
+ * {{{
+ * // Left join of df1 and df2 using the column "user_id"
+ * df1.join(df2, "user_id", "left")
+ * }}}
+ *
+ * @param right Right side of the join operation.
+ * @param usingColumn Name of the column to join on. This column must
exist on both sides.
+ * @param joinType Type of join to perform. Default `inner`. Must be
one of:
+ * `inner`, `cross`, `outer`, `full`, `full_outer`,
`left`, `left_outer`,
+ * `right`, `right_outer`, `left_semi`, `left_anti`.
+ * @note If you perform a self-join using this function without
aliasing the input
+ * `DataFrame`s, you will NOT be able to reference any columns after
the join, since
+ * there is no way to disambiguate which side of the join you would
like to reference.
+ * @group untypedrel
+ */
+ def join(right: Dataset[_], usingColumn: String, joinType: String):
DataFrame = {
--- End diff --
@arman1371 . We understand that this PR is trying to add a syntactic sugar.
But you need only 5 characters, `Seq(` and `)`, to use the existing general
API. Personally, I agree with @wangyum . I prefer not to add this.
Historically,
1. Spark 1.4 adds `Seq[String]` version was added later to support PySpark
(SPARK-7990)
2. Spark 1.6 adds `join` type to `Seq[String]` version (SPARK-10446)
It's a long time ago. Given that, I guess Apache Spark community
intentionally didn't add the `String` version for this in order to keep
`Dataset` simple in terms of the number of APIs. Anyway, since you need an
answer, let's ask the general opinion again to make a decision.
Hi, @rxin, @cloud-fan , @gatorsmile . Did we explicitly decide not to add
this API ? It seems that @arman1371 wants to add this for feature parity with
PySpark at Spark 3.0.0.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]