Wassim Almaaoui created SPARK-35652:
---------------------------------------

             Summary: Different Behaviour join vs joinWith in self joining
                 Key: SPARK-35652
                 URL: https://issues.apache.org/jira/browse/SPARK-35652
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.2
         Environment: {color:#172b4d}Spark 3.1.2{color}

Scala 2.12
            Reporter: Wassim Almaaoui


It seems like spark inner join is performing a cartesian join in self joining 
using `joinWith` and an inner join using `join` 

Snippet:

 
{code:java}
scala> val df = spark.range(0,5) 
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.show 
+---+ 
| id| 
+---+ 
| 0|
| 1| 
| 2| 
| 3| 
| 4| 
+---+ 

scala> df.join(df, df("id") === df("id")).count 
21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, 
'id#1649L = id#1649L'. Perhaps you need to use aliases. 
res21: Long = 5

scala> df.joinWith(df, df("id") === df("id")).count
21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, 
'id#1649L = id#1649L'. Perhaps you need to use aliases. 
res22: Long = 25

{code}
According to the comment in code source, joinWith is expected to manage this 
case, right?
{code:java}
def joinWith[U](other: Dataset[U], condition: Column, joinType: String): 
Dataset[(T, U)] = {
    // Creates a Join node and resolve it first, to get join condition 
resolved, self-join resolved,
    // etc.
{code}
I find it weird that join and joinWith haven't the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to