cloud-fan commented on code in PR #56603:
URL: https://github.com/apache/spark/pull/56603#discussion_r3440726378
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##########
@@ -110,8 +110,14 @@ object PartitionPruning extends Rule[LogicalPlan] with
PredicateHelper with Join
require(filteringKeys.size == 1, "DPP Filters should only have a single
broadcasting key " +
"since there are no usage for multiple broadcasting keys at the moment.")
val indices = Seq(joinKeys.indexOf(filteringKeys.head))
- lazy val hasBenefit = pruningHasBenefit(
- pruningKey, partScan, filteringKeys.head, filteringPlan)
+ // A filtering side that is eligible only because it is already
materialized (a LocalRelation
+ // or a checkpoint-derived LogicalRDD) carries no selective predicate.
pruningHasBenefit's
+ // fallback filtering ratio is justified only "when CBO stats are missing,
but there is a
+ // predicate that is likely to be selective", which does not hold here, so
it would assume a
+ // benefit that may not exist. Without an estimated benefit such a side is
only injected when it
+ // can reuse a broadcast (onlyInBroadcast), never as a standalone
always-applied subquery.
+ lazy val hasBenefit = hasSelectivePredicate(filteringPlan) &&
Review Comment:
Good catch, fixed in 5a5535d. You're right that the previous guard
short-circuited before `pruningHasBenefit` could consult column statistics,
dropping a genuinely beneficial standalone DPP for a checkpoint-derived
`LogicalRDD` that retains NDV stats.
I moved the guard into `pruningHasBenefit`: a statistics-based filtering
ratio is now always honored, and only the configured **fallback** ratio is
gated on `hasSelectivePredicate`. So a materialized side with no usable stats
reports no benefit (broadcast-reuse only), while one with retained NDV stats
keeps standalone DPP.
I also added the regression test you asked for — `a materialized filtering
side keeps statistics-backed standalone DPP`: a checkpointed `LogicalRDD` with
retained NDV (key-side NDV = 1), `fallbackFilterRatio = 0`, `reuseBroadcastOnly
= false`, and `exchange.reuse = false` so no broadcast can be reused. It
asserts a standalone DPP subquery is injected. I verified it fails on the
previous revision (`false did not equal true`) and passes now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]