[PR] [SPARK-54593][SQL][FOLLOWUP] Do not narrow materialized-input DPP eligibility based on operators above the materialized input [spark]

via GitHub Sun, 21 Jun 2026 00:08:35 -0700


cloud-fan opened a new pull request, #56636:
URL: https://github.com/apache/spark/pull/56636


   ### What changes were proposed in this pull request?
   
   Follow-up to #56535 (SPARK-54593). That PR narrowed materialized-input DPP 
eligibility from "the filtering side contains a materialized input" to a 
structural allowlist (`isRepeatableMaterializedPlan`: a materialized leaf 
composed only through deterministic 
`Project`/`Filter`/`Union`/`SubqueryAlias`). This reverts that narrowing: 
eligibility again only checks that the side contains an already-materialized 
input (a `LocalRelation`, or a checkpoint-derived `LogicalRDD`). The 
materialization guard from #56535 -- `isCheckpointedInput` requiring 
`rdd.isCheckpointed`, so a lazy checkpoint isn't treated as materialized -- is 
**kept**.
   
   ### Why are the changes needed?
   
   The allowlist tried to ensure the operators *above* the materialized leaf 
are repeatable. But that is the **general DPP re-evaluation concern**, not 
specific to materialized inputs: DPP duplicates the filtering side on every 
eligibility path, so a non-deterministic operator (a `mapPartitions` closure, a 
UDF over a non-deterministic source) is non-repeatable on the 
selective-predicate path too -- and Spark cannot decide a plan's repeatability 
in general (opaque RDD/closure non-determinism is invisible to Catalyst). So 
the allowlist (a) does not solve the general problem and (b) over-rejects 
legitimate deterministic materialized sides (e.g. an aggregate, or any 
non-allowlisted operator, over a materialized input) that re-evaluate fine.
   
   The one genuinely materialized-input-specific hazard -- a lazy checkpoint 
that has not been materialized yet -- is handled by `isCheckpointedInput` 
requiring `rdd.isCheckpointed`, which is retained.
   
   A non-repeatable plan above a materialized input (e.g. 
`checkpoint.mapPartitions(counter)`) can again be DPP-eligible. That is the 
same pre-existing, universal DPP re-evaluation limitation that the 
selective-predicate path already has; if we want to address it, it should be a 
uniform DPP-wide change, not a materialized-input-only narrowing.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. DPP is an optimization; results are unchanged for repeatable filtering 
sides, which is the supported case.
   
   ### How was this patch tested?
   
   Existing `DynamicPartitionPruning*Suite`s. Removes the two tests added in 
#56535 that asserted the reverted operators-above narrowing.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Anthropic Claude Opus)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-54593][SQL][FOLLOWUP] Do not narrow materialized-input DPP eligibility based on operators above the materialized input [spark]

Reply via email to