JoshRosen commented on code in PR #43938:
URL: https://github.com/apache/spark/pull/43938#discussion_r1402767552


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1755,4 +1755,27 @@ class JoinSuite extends QueryTest with 
SharedSparkSession with AdaptiveSparkPlan
       cached.unpersist()
     }
   }
+
+  test("SPARK-46037: When Left Join build Left, ShuffledHashJoinExec may " +

Review Comment:
   I wonder whether the scope of this bug is broader in scope than just left 
outer joins:
   
   For inner joins, we can build either side but the current code assumes that 
the build side is always the right side. According to our docs at 
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html:
   
   > SHUFFLE_HASH
   > Suggests that Spark use shuffle hash join. If both sides have the shuffle 
hash hints, Spark chooses the smaller side (based on stats) as the build side.
   
   which I interpret to mean that we could force a build on the right hand side 
by putting the hint on the right table of an inner join.
   
   ---
   
   The build side selection for shuffle hash joins is done at 
   
   
https://github.com/apache/spark/blob/2966c791cb9048e86407b03053a16033a56f422e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L313-L342
   
   and if no hint is given then I think control will eventually flow to 
   
   
https://github.com/apache/spark/blob/2966c791cb9048e86407b03053a16033a56f422e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L487-L503
   
   and end up calling
   
   
https://github.com/apache/spark/blob/2966c791cb9048e86407b03053a16033a56f422e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L354-L356
   
   and if neither plan has stats then it looks like we implicitly prefer to 
build the right side.
   
   ----
   
   Given all of this, I think we probably also have a correctness bug for inner 
joins but that whether or not it is actually hit is a function of plan stats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to