adriangb opened a new pull request, #21620:
URL: https://github.com/apache/datafusion/pull/21620

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   [#21160](https://github.com/apache/datafusion/pull/21160) added 
`datafusion.explain.analyze_categories`, which lets `EXPLAIN ANALYZE` emit only 
deterministic metric categories (e.g. `'rows'`). That unlocked a long-standing 
blocker on porting tests out of 
`datafusion/core/tests/physical_optimizer/filter_pushdown.rs`: previously these 
tests had to assert on execution state via `insta` snapshots over hand-wired 
`ExecutionPlan` trees and mock `TestSource` data, which kept them expensive to 
read, expensive to update, and impossible to test from the user-facing SQL path.
   
   With `analyze_categories = 'rows'`, the `predicate=DynamicFilter [ ... ]` 
text on a parquet scan is stable across runs, so the same invariants can now be 
expressed as plain `EXPLAIN ANALYZE` SQL in sqllogictest, where they are easier 
to read, easier to update, and exercise the full SQL → logical optimizer → 
physical optimizer → execution pipeline rather than a single optimizer rule in 
isolation.
   
   ## What changes are included in this PR?
   
   24 end-to-end filter-pushdown tests are ported out of `filter_pushdown.rs` 
and deleted. The helpers `run_aggregate_dyn_filter_case` and 
`run_projection_dyn_filter_case` (and their supporting structs) are deleted 
along with the tests that used them. The 24 synchronous `#[test]` 
optimizer-rule-in-isolation tests are untouched — they stay in Rust because 
they specifically exercise `FilterPushdown::new()` / `OptimizationTest` over a 
hand-built plan.
   
   ### `datafusion/sqllogictest/test_files/push_down_filter_parquet.slt`
   
   New tests covering:
   
   - TopK dynamic filter pushdown integration (100k-row parquet, 
`max_row_group_size = 128`, asserting on `pushdown_rows_matched = 128` / 
`pushdown_rows_pruned = 99.87 K`)
   - TopK single-column and multi-column (compound-sort) dynamic filter shapes
   - HashJoin CollectLeft dynamic filter with `struct(a, b) IN (SET) ([...])` 
content
   - Nested hash joins propagating filters to both inner scans
   - Parent `WHERE` filter splitting across the two sides of a HashJoin
   - TopK above HashJoin, with both dynamic filters ANDed on the probe scan
   - Dynamic filter flowing through a `GROUP BY` sitting between a HashJoin and 
the probe scan
   - TopK projection rewrite — reorder, prune, expression, alias shadowing
   - NULL-bearing build-side join keys
   - `LEFT JOIN` and `LEFT SEMI JOIN` dynamic filter pushdown
   - HashTable strategy (`hash_lookup`) via `hash_join_inlist_pushdown_max_size 
= 1`, on both string and integer multi-column keys
   
   ### `datafusion/sqllogictest/test_files/push_down_filter_regression.slt`
   
   New tests covering:
   
   - Aggregate dynamic filter baseline: `MIN(a)`, `MAX(a)`, `MIN(a), MAX(a)`, 
`MIN(a), MAX(b)`, mixed `MIN/MAX` with an unsupported expression input, 
all-NULL input (filter stays `true`), `MIN(a+1)` (no filter emitted)
   - `WHERE` filter on a grouping column pushes through `AggregateExec`
   - `HAVING count(b) > 5` filter stays above the aggregate
   - End-to-end aggregate dynamic filter actually pruning a multi-file parquet 
scan
   
   The aggregate baseline tests run under `analyze_level = summary` + 
`analyze_categories = 'none'` so that metrics render empty and only the 
`predicate=DynamicFilter [ ... ]` content remains — the filter text is 
deterministic even though the pruning counts are subject to parallel-execution 
scheduling.
   
   ### What stayed in Rust
   
   Ten async tests now carry a short `// Not portable to sqllogictest: …` 
header explaining why. In short, they either:
   
   - Hand-wire `PartitionMode::Partitioned` or a `RepartitionExec` boundary 
that SQL never constructs for the sizes of data these tests use
   - Assert via debug-only APIs 
(`HashJoinExec::dynamic_filter_for_test().is_used()`, 
`ExecutionPlan::apply_expressions()` + 
`downcast_ref::<DynamicFilterPhysicalExpr>`) that are not observable from SQL
   - Target the specific stacked-`FilterExec` shape (#20109 regression) that 
the logical optimizer collapses before physical planning
   
   ## Are these changes tested?
   
   Yes — the ported tests _are_ the tests. Each ported slt case was generated 
with `cargo test -p datafusion-sqllogictest --test sqllogictests -- <file> 
--complete`, then re-run twice back-to-back without `--complete` to confirm 
determinism. The remaining Rust `filter_pushdown` tests continue to pass 
(`cargo test -p datafusion --test core_integration filter_pushdown` → 47 
passed, 0 failed). `cargo clippy --tests -D warnings` and `cargo fmt --all` are 
clean.
   
   ## Test plan
   
   - [x] `cargo test -p datafusion-sqllogictest --test sqllogictests -- 
push_down_filter`
   - [x] `cargo test -p datafusion --test core_integration filter_pushdown`
   - [x] `cargo clippy -p datafusion --tests -- -D warnings`
   - [x] `cargo fmt --all`
   
   ## Are there any user-facing changes?
   
   No. This is a test-only refactor.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to