cloud-fan opened a new pull request, #56603: URL: https://github.com/apache/spark/pull/56603
### What changes were proposed in this pull request? Follow-up to https://github.com/apache/spark/pull/56071 (SPARK-54593), which enabled dynamic partition pruning (DPP) for already-materialized filtering sides (a `LocalRelation` or a `checkpoint()` / `localCheckpoint()`-derived `LogicalRDD`). This PR makes `PartitionPruning` consider a materialized filtering side for DPP only when the filter can reuse a broadcast (`onlyInBroadcast`), instead of also injecting it as a standalone, always-applied subquery. Concretely, `insertPredicate` now treats the side as beneficial only when it carries a selective predicate; an already-materialized side (which has no predicate) no longer gets `hasBenefit = true` from the fallback filtering ratio. ### Why are the changes needed? `pruningHasBenefit` estimates the filtering ratio from column statistics, falling back to `spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio` (default `0.5`) — a value documented as the ratio to use "when CBO stats are missing, but there is a predicate that is likely to be selective". A materialized filtering side carries no such predicate and typically has no column statistics, so `pruningHasBenefit` falls back to assuming it is 50%-selective and returns `true` regardless of the side's actual selectivity. Consequently, with `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false`, a materialized filtering side that covers all (or most) of the probe table's partitions — and therefore prunes nothing — is still injected as an always-applied DPP subquery that re-executes the filtering side and evaluates a partition filter for no benefit. Before SPARK-54593 such a side was not DPP-eligible at all, so this is a regression within unreleased master. When the side can reuse a broadcast the cost is negligible (the intended use of the feature), so this PR keeps that path and only avoids the standalone-subquery case. ### Does this PR introduce _any_ user-facing change? No. It only avoids planning a no-benefit dynamic partition pruning subquery for a materialized build side; query results are unchanged. ### How was this patch tested? New unit test in `DynamicPartitionPruningSuite` ("a materialized filtering side is not injected as a standalone DPP subquery without an estimated pruning benefit"): with broadcast joins disabled and `reuseBroadcastOnly=false`, a materialized filtering side that covers every partition no longer triggers a DPP subquery. Existing positive tests for materialized filtering sides (which use broadcast reuse) continue to pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic Claude Opus) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
