[GitHub] [spark] brandondahler commented on a change in pull request #33323: [SPARK-35739][SQL] Add Java-compatible Dataset.join overloads

GitBox Mon, 16 Aug 2021 07:05:38 -0700


brandondahler commented on a change in pull request #33323:
URL: https://github.com/apache/spark/pull/33323#discussion_r689567804




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -970,7 +995,59 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Equi-join with another `DataFrame` using the given columns. A cross join 
with a predicate
+   * Equi-join with another `DataFrame` using the given column. A cross join 
with a predicate
+   * is specified as an inner join. If you would explicitly like to perform a 
cross join use the
+   * `crossJoin` method.
+   *
+   * Different from other join functions, the join column will only appear 
once in the output,
+   * i.e. similar to SQL's `JOIN USING` syntax.
+   *
+   * @param right Right side of the join operation.
+   * @param usingColumn Name of the column to join on. This column must exist 
on both sides.
+   * @param joinType Type of join to perform. Default `inner`. Must be one of:
+   *                 `inner`, `cross`, `outer`, `full`, `fullouter`, 
`full_outer`, `left`,
+   *                 `leftouter`, `left_outer`, `right`, `rightouter`, 
`right_outer`,
+   *                 `semi`, `leftsemi`, `left_semi`, `anti`, `leftanti`, 
left_anti`.
+   *
+   * @note If you perform a self-join using this function without aliasing the 
input
+   * `DataFrame`s, you will NOT be able to reference any columns after the 
join, since
+   * there is no way to disambiguate which side of the join you would like to 
reference.
+   *
+   * @group untypedrel
+   * @since 3.3.0
+   */
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
+    join(right, Seq(usingColumn), joinType)
+  }
+
+  /**
+   * (Java-specific) Equi-join with another `DataFrame` using the given 
columns. A cross join with

Review comment:
       I'm not familiar with scala in general and I tried to mess with 
Scaladoc's `@see` link formatting, but I wasn't able to successfully get it to 
actually link to the other overloaded method.
   
   I used these links for reference and did various different formats as 
suggested, but IntelliJ doesn't seem to accept them (presuming that the 
IntelliJ implementation correctly processes them):
   * https://docs.scala-lang.org/overviews/scaladoc/for-library-authors.html
   * https://stackoverflow.com/questions/53850885/how-to-use-see-scaladoc
   * 
https://github.com/scala/scala/blob/2.12.x/test/scaladoc/resources/links.scala
   
   I'm happy to make the change if anyone knows the correct syntax for 
referring from `Dataset#join(Dataset[_], Array[String], String): DataFrame` to 
`Dataset#join(Dataset[_], Seq[String], String): DataFrame)`.
   
   ---
   
   All that being said, as a user of the library I'd prefer the docs not 
require an extra level of indirection in this case because there's not really 
that much information being expressed in these docs -- the biggest section in 
it is the `@param joinType` and that's only because it is listing out all the 
valid values.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] brandondahler commented on a change in pull request #33323: [SPARK-35739][SQL] Add Java-compatible Dataset.join overloads

Reply via email to