wangyum commented on code in PR #33522:
URL: https://github.com/apache/spark/pull/33522#discussion_r898557548


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1057,7 +1057,7 @@ class JoinSuite extends QueryTest with SharedSparkSession 
with AdaptiveSparkPlan
     val pythonEvals = collect(joinNode.get) {
       case p: BatchEvalPythonExec => p
     }
-    assert(pythonEvals.size == 2)
+    assert(pythonEvals.size == 4)

Review Comment:
   I think pull out has 3 advantages:
   1. Reduce complex join key runs from 3 to 2 for SMJ.
   2. Infer additional filters, sometimes can avoid data skew. For example: 
https://github.com/apache/spark/pull/28642
   3. Avoid other rules also handle this case. For example: 
https://github.com/apache/spark/blob/dee7396204e2f6e7346e220867953fc74cd4253d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushPartialAggregationThroughJoin.scala#L325-L327
   
   It has two disadvantage:
   1. Increase complex join key runs from 1 to 2 for BHJ.
   2. It may increase the data size of shuffle. For example: the join key is: 
`concat(col1, col2, col3, col4 ...)`.
   
   Personally, I think this rule is valuable. We have been using this rule for 
half a year.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to