cloud-fan commented on a change in pull request #31984:
URL: https://github.com/apache/spark/pull/31984#discussion_r607642852
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala
##########
@@ -247,13 +246,15 @@ object PartitionPruning extends Rule[LogicalPlan] with
PredicateHelper with Join
// otherwise the pruning will not trigger
var partScan = getPartitionTableScan(l, left)
if (partScan.isDefined && canPruneLeft(joinType) &&
- hasPartitionPruningFilter(right)) {
+ hasPartitionPruningFilter(right) &&
+ (canBroadcastBySize(right, conf) ||
hintToBroadcastRight(hint))) {
newLeft = insertPredicate(l, newLeft, r, right, rightKeys,
partScan.get,
canBuildBroadcastRight(joinType))
} else {
partScan = getPartitionTableScan(r, right)
if (partScan.isDefined && canPruneRight(joinType) &&
- hasPartitionPruningFilter(left) ) {
+ hasPartitionPruningFilter(left) &&
+ (canBroadcastBySize(left, conf) ||
hintToBroadcastLeft(hint))) {
Review comment:
If `broadcastOnly=true`, we can't skip DPP here and need to wait until
the planner phase to see if it's a broadcast join.
If `broadcastOnly=false`, and `hasBenefits=true`, we always do DPP (this PR
changes this behavior). If `hasBenefits=false`, we can't skip DPP here and need
to wait until the planner phase to see if it's a broadcast join.
That said, the shortcut here is to skip DPP earlier if we know for sure that
the join can't be broadcast join. I feel it's more robust to wait for the
planner and see if it' a broadcast join. Do we really need the shortcut?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]