szehon-ho opened a new pull request, #56195: URL: https://github.com/apache/spark/pull/56195
### What changes were proposed in this pull request? In `PushDownUtils.pushFilters`, for scans implementing `SupportsPushDownV2Filters` with iterative pushdown (`supportsIterativePushdown() == true`), a second pass derives `PartitionPredicate`s from filters left over after the first pass and pushes them down. Previously, the candidate filters for this second pass were taken from the predicates **returned** by `pushPredicates()` (the post-scan filters). Per the `SupportsPushDownV2Filters` contract, that return value contains both: - non-pushable predicates, and - pushable predicates that were accepted but still need post-scan evaluation (partial pushdown, e.g. a Parquet row group filter). The latter are reported by `pushedPredicates()`. Using the returned predicates as candidates therefore re-derived `PartitionPredicate`s from filters that were **already pushed** in the first pass, pushing the same filter down twice. This PR changes the second-pass candidate selection to only use filters that were **not** already pushed down in the first pass (i.e. not in `pushedPredicates()`). Filters that were pushed but still need post-scan evaluation remain in the post-scan set, but are no longer re-derived as `PartitionPredicate`s. This mirrors the existing runtime-filter path (`pushRuntimeFilters`), which already excludes already-pushed predicates. ### Why are the changes needed? The previous behavior pushed the same filter to the data source twice (once as the original predicate in the first pass, and again as a `PartitionPredicate` in the second pass) whenever a data source partially pushes a partition filter (accepts it but also returns it for post-scan evaluation). This is redundant work and inconsistent with the documented contract and the runtime-filter path. ### Does this PR introduce _any_ user-facing change? No. This affects the in-progress iterative `PartitionPredicate` pushdown path (SPARK-55596) and is not part of a released Spark version. ### How was this patch tested? Added unit tests to `DataSourceV2EnhancedPartitionFilterSuite` (case 9, plus a nested-partition variant) covering a partition filter that is accepted and also returned in the first pass; the test asserts it is pruned in the first pass, kept as a post-scan filter, and not re-pushed as a `PartitionPredicate` in the second pass. A new `return-accepted-partition-predicates` property was added to `InMemoryEnhancedPartitionFilterTable` to simulate partial pushdown. All 28 tests in the suite pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
