[
https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan resolved SPARK-35652.
---------------------------------
Fix Version/s: 3.1.3
3.2.0
Resolution: Fixed
> Different Behaviour join vs joinWith in self joining
> ----------------------------------------------------
>
> Key: SPARK-35652
> URL: https://issues.apache.org/jira/browse/SPARK-35652
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.2
> Environment: {color:#172b4d}Spark 3.1.2{color}
> Scala 2.12
> Reporter: Wassim Almaaoui
> Assignee: dgd_contributor
> Priority: Critical
> Fix For: 3.2.0, 3.1.3
>
>
> It seems like spark inner join is performing a cartesian join in self joining
> using `joinWith` and an inner join using `join`
> Snippet:
>
> {code:java}
> scala> val df = spark.range(0,5)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> df.show
> +---+
> | id|
> +---+
> | 0|
> | 1|
> | 2|
> | 3|
> | 4|
> +---+
> scala> df.join(df, df("id") === df("id")).count
> 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate,
> 'id#1649L = id#1649L'. Perhaps you need to use aliases.
> res21: Long = 5
> scala> df.joinWith(df, df("id") === df("id")).count
> 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate,
> 'id#1649L = id#1649L'. Perhaps you need to use aliases.
> res22: Long = 25
> {code}
> According to the comment in code source, joinWith is expected to manage this
> case, right?
> {code:java}
> def joinWith[U](other: Dataset[U], condition: Column, joinType: String):
> Dataset[(T, U)] = {
> // Creates a Join node and resolve it first, to get join condition
> resolved, self-join resolved,
> // etc.
> {code}
> I find it weird that join and joinWith haven't the same behaviour.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]