Wassim Almaaoui created SPARK-35652:
---------------------------------------
Summary: Different Behaviour join vs joinWith in self joining
Key: SPARK-35652
URL: https://issues.apache.org/jira/browse/SPARK-35652
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.1.2
Environment: {color:#172b4d}Spark 3.1.2{color}
Scala 2.12
Reporter: Wassim Almaaoui
It seems like spark inner join is performing a cartesian join in self joining
using `joinWith` and an inner join using `join`
Snippet:
{code:java}
scala> val df = spark.range(0,5)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> df.join(df, df("id") === df("id")).count
21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate,
'id#1649L = id#1649L'. Perhaps you need to use aliases.
res21: Long = 5
scala> df.joinWith(df, df("id") === df("id")).count
21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate,
'id#1649L = id#1649L'. Perhaps you need to use aliases.
res22: Long = 25
{code}
According to the comment in code source, joinWith is expected to manage this
case, right?
{code:java}
def joinWith[U](other: Dataset[U], condition: Column, joinType: String):
Dataset[(T, U)] = {
// Creates a Join node and resolve it first, to get join condition
resolved, self-join resolved,
// etc.
{code}
I find it weird that join and joinWith haven't the same behaviour.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]