[GitHub] [spark] wangyum commented on a diff in pull request #33522: [SPARK-36290][SQL] Pull out join condition

GitBox Wed, 15 Jun 2022 16:34:05 -0700


wangyum commented on code in PR #33522:
URL: https://github.com/apache/spark/pull/33522#discussion_r898557548



##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1057,7 +1057,7 @@ class JoinSuite extends QueryTest with SharedSparkSession 
with AdaptiveSparkPlan
     val pythonEvals = collect(joinNode.get) {
       case p: BatchEvalPythonExec => p
     }
-    assert(pythonEvals.size == 2)
+    assert(pythonEvals.size == 4)

Review Comment:
   I think pull out has 3 advantages:
   1. Reduce complex join key runs from 3 to 2 for SMJ.
   2. Infer additional filters, sometimes can avoid data skew. For example: 
https://github.com/apache/spark/pull/28642
   3. Avoid other rules also handle this case. For example: 
https://github.com/apache/spark/blob/dee7396204e2f6e7346e220867953fc74cd4253d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushPartialAggregationThroughJoin.scala#L325-L327
   
   It has two disadvantage:
   1. Increase complex join key runs from 1 to 2 for BHJ.
   2. It may increase the data size of shuffle. For example: the join key is: 
`concat(col1, col2, col3, col4 ...)`.
   
   Personally, I think this rule is valuable. We have been using this rule for 
half a year.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wangyum commented on a diff in pull request #33522: [SPARK-36290][SQL] Pull out join condition

Reply via email to