wangyum commented on code in PR #33522:
URL: https://github.com/apache/spark/pull/33522#discussion_r898557548
##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1057,7 +1057,7 @@ class JoinSuite extends QueryTest with SharedSparkSession
with AdaptiveSparkPlan
val pythonEvals = collect(joinNode.get) {
case p: BatchEvalPythonExec => p
}
- assert(pythonEvals.size == 2)
+ assert(pythonEvals.size == 4)
Review Comment:
I think pull out has 3 advantages:
1. Reduce complex join key runs from 3 to 2 for SMJ.
2. Infer additional filters, sometimes can avoid data skew. For example:
https://github.com/apache/spark/pull/28642
3. Avoid other rules also handle this case. For example:
https://github.com/apache/spark/blob/dee7396204e2f6e7346e220867953fc74cd4253d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushPartialAggregationThroughJoin.scala#L325-L327
It has two disadvantage:
1. Increase complex join key runs from 1 to 2 for BHJ.
2. It may increase the data size of shuffle. For example: the join key is:
`concat(col1, col2, col3, col4 ...)`.
Personally, I think this rule is valuable. We have been using this rule for
half a year.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]