cloud-fan commented on code in PR #56603:
URL: https://github.com/apache/spark/pull/56603#discussion_r3440726378


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
##########
@@ -110,8 +110,14 @@ object PartitionPruning extends Rule[LogicalPlan] with 
PredicateHelper with Join
     require(filteringKeys.size == 1, "DPP Filters should only have a single 
broadcasting key " +
       "since there are no usage for multiple broadcasting keys at the moment.")
     val indices = Seq(joinKeys.indexOf(filteringKeys.head))
-    lazy val hasBenefit = pruningHasBenefit(
-      pruningKey, partScan, filteringKeys.head, filteringPlan)
+    // A filtering side that is eligible only because it is already 
materialized (a LocalRelation
+    // or a checkpoint-derived LogicalRDD) carries no selective predicate. 
pruningHasBenefit's
+    // fallback filtering ratio is justified only "when CBO stats are missing, 
but there is a
+    // predicate that is likely to be selective", which does not hold here, so 
it would assume a
+    // benefit that may not exist. Without an estimated benefit such a side is 
only injected when it
+    // can reuse a broadcast (onlyInBroadcast), never as a standalone 
always-applied subquery.
+    lazy val hasBenefit = hasSelectivePredicate(filteringPlan) &&

Review Comment:
   Good catch, fixed in 5a5535d. You're right that the previous guard 
short-circuited before `pruningHasBenefit` could consult column statistics, 
dropping a genuinely beneficial standalone DPP for a checkpoint-derived 
`LogicalRDD` that retains NDV stats.
   
   I moved the guard into `pruningHasBenefit`: a statistics-based filtering 
ratio is now always honored, and only the configured **fallback** ratio is 
gated on `hasSelectivePredicate`. So a materialized side with no usable stats 
reports no benefit (broadcast-reuse only), while one with retained NDV stats 
keeps standalone DPP.
   
   I also added the regression test you asked for — `a materialized filtering 
side keeps statistics-backed standalone DPP`: a checkpointed `LogicalRDD` with 
retained NDV (key-side NDV = 1), `fallbackFilterRatio = 0`, `reuseBroadcastOnly 
= false`, and `exchange.reuse = false` so no broadcast can be reused. It 
asserts a standalone DPP subquery is injected. I verified it fails on the 
previous revision (`false did not equal true`) and passes now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to