c21 commented on a change in pull request #31193:
URL: https://github.com/apache/spark/pull/31193#discussion_r572724880
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala
##########
@@ -53,22 +50,6 @@ object PushDownLeftSemiAntiJoin extends Rule[LogicalPlan]
p.copy(child = Join(gChild, rightOp, joinType, newJoinCond, hint))
}
- // LeftSemi/LeftAnti over Aggregate, only push down if join can be planned
as broadcast join.
- case join @ Join(agg: Aggregate, rightOp, LeftSemiOrAnti(_), _, _)
- if agg.aggregateExpressions.forall(_.deterministic) &&
agg.groupingExpressions.nonEmpty &&
Review comment:
* aggregation reduces data
* left semi / left anti broadcast join reduces data
I feel these two heuristics are both legit, and with current statistics
available in logical plan, it's hard to always make right decision. If we
really need this, maybe add a disable-by-default config for pushing down these
join over aggregate? (by default we still follow existing behavior)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]