Github user yucai commented on the issue:
https://github.com/apache/spark/pull/21564
@viirya I think`PartitioningCollection` should be considered. Like below
case:
```
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.codegen.wholeStage", false)
val df1 = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j").as("t1")
val df2 = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("m", "n").as("t2")
val d = df1.join(df2, $"t1.i" === $"t2.m")
d.cache
val d1 = d.as("t3")
val d2 = d.as("t4")
d1.join(d2, $"t3.i" === $"t4.i").explain
```
```
SortMergeJoin [i#5], [i#54], Inner
:- InMemoryTableScan [i#5, j#6, m#15, n#16]
: +- InMemoryRelation [i#5, j#6, m#15, n#16], CachedRDDBuilder
: +- SortMergeJoin [i#5], [m#15], Inner
: :- Sort [i#5 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(i#5, 10)
: : +- LocalTableScan [i#5, j#6]
: +- Sort [m#15 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(m#15, 10)
: +- LocalTableScan [m#15, n#16]
+- Sort [i#54 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i#54, 10)
+- InMemoryTableScan [i#54, j#55, m#58, n#59]
+- InMemoryRelation [i#54, j#55, m#58, n#59], CachedRDDBuilder
+- SortMergeJoin [i#5], [m#15], Inner
:- Sort [i#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(i#5, 10)
: +- LocalTableScan [i#5, j#6]
+- Sort [m#15 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(m#15, 10)
+- LocalTableScan [m#15, n#16]
```
`Exchange hashpartitioning(i#54, 10)` is extra shuffle.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]