szehon-ho opened a new pull request, #56195:
URL: https://github.com/apache/spark/pull/56195

   ### What changes were proposed in this pull request?
   
   In `PushDownUtils.pushFilters`, for scans implementing 
`SupportsPushDownV2Filters` with iterative pushdown 
(`supportsIterativePushdown() == true`), a second pass derives 
`PartitionPredicate`s from filters left over after the first pass and pushes 
them down.
   
   Previously, the candidate filters for this second pass were taken from the 
predicates **returned** by `pushPredicates()` (the post-scan filters). Per the 
`SupportsPushDownV2Filters` contract, that return value contains both:
   - non-pushable predicates, and
   - pushable predicates that were accepted but still need post-scan evaluation 
(partial pushdown, e.g. a Parquet row group filter).
   
   The latter are reported by `pushedPredicates()`. Using the returned 
predicates as candidates therefore re-derived `PartitionPredicate`s from 
filters that were **already pushed** in the first pass, pushing the same filter 
down twice.
   
   This PR changes the second-pass candidate selection to only use filters that 
were **not** already pushed down in the first pass (i.e. not in 
`pushedPredicates()`). Filters that were pushed but still need post-scan 
evaluation remain in the post-scan set, but are no longer re-derived as 
`PartitionPredicate`s. This mirrors the existing runtime-filter path 
(`pushRuntimeFilters`), which already excludes already-pushed predicates.
   
   ### Why are the changes needed?
   
   The previous behavior pushed the same filter to the data source twice (once 
as the original predicate in the first pass, and again as a 
`PartitionPredicate` in the second pass) whenever a data source partially 
pushes a partition filter (accepts it but also returns it for post-scan 
evaluation). This is redundant work and inconsistent with the documented 
contract and the runtime-filter path.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This affects the in-progress iterative `PartitionPredicate` pushdown 
path (SPARK-55596) and is not part of a released Spark version.
   
   ### How was this patch tested?
   
   Added unit tests to `DataSourceV2EnhancedPartitionFilterSuite` (case 9, plus 
a nested-partition variant) covering a partition filter that is accepted and 
also returned in the first pass; the test asserts it is pruned in the first 
pass, kept as a post-scan filter, and not re-pushed as a `PartitionPredicate` 
in the second pass. A new `return-accepted-partition-predicates` property was 
added to `InMemoryEnhancedPartitionFilterTable` to simulate partial pushdown. 
All 28 tests in the suite pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Cursor (Claude Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to