cloud-fan opened a new pull request, #56603:
URL: https://github.com/apache/spark/pull/56603

   ### What changes were proposed in this pull request?
   
   Follow-up to https://github.com/apache/spark/pull/56071 (SPARK-54593), which 
enabled dynamic partition pruning (DPP) for already-materialized filtering 
sides (a `LocalRelation` or a `checkpoint()` / `localCheckpoint()`-derived 
`LogicalRDD`).
   
   This PR makes `PartitionPruning` consider a materialized filtering side for 
DPP only when the filter can reuse a broadcast (`onlyInBroadcast`), instead of 
also injecting it as a standalone, always-applied subquery. Concretely, 
`insertPredicate` now treats the side as beneficial only when it carries a 
selective predicate; an already-materialized side (which has no predicate) no 
longer gets `hasBenefit = true` from the fallback filtering ratio.
   
   ### Why are the changes needed?
   
   `pruningHasBenefit` estimates the filtering ratio from column statistics, 
falling back to 
`spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio` (default 
`0.5`) — a value documented as the ratio to use "when CBO stats are missing, 
but there is a predicate that is likely to be selective". A materialized 
filtering side carries no such predicate and typically has no column 
statistics, so `pruningHasBenefit` falls back to assuming it is 50%-selective 
and returns `true` regardless of the side's actual selectivity.
   
   Consequently, with 
`spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false`, a 
materialized filtering side that covers all (or most) of the probe table's 
partitions — and therefore prunes nothing — is still injected as an 
always-applied DPP subquery that re-executes the filtering side and evaluates a 
partition filter for no benefit. Before SPARK-54593 such a side was not 
DPP-eligible at all, so this is a regression within unreleased master. When the 
side can reuse a broadcast the cost is negligible (the intended use of the 
feature), so this PR keeps that path and only avoids the standalone-subquery 
case.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. It only avoids planning a no-benefit dynamic partition pruning subquery 
for a materialized build side; query results are unchanged.
   
   ### How was this patch tested?
   
   New unit test in `DynamicPartitionPruningSuite` ("a materialized filtering 
side is not injected as a standalone DPP subquery without an estimated pruning 
benefit"): with broadcast joins disabled and `reuseBroadcastOnly=false`, a 
materialized filtering side that covers every partition no longer triggers a 
DPP subquery. Existing positive tests for materialized filtering sides (which 
use broadcast reuse) continue to pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Anthropic Claude Opus)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to