darmie opened a new pull request, #20417:
URL: https://github.com/apache/datafusion/pull/20417

   ## Which issue does this PR close?
   
   - Closes part of #20324 (addresses the "filter columns ⊆ projection columns" 
category of regressions).
   - Related: #20325 (Q10 investigation)
   
   ## Rationale for this change
   
   When `pushdown_filters = true` and all predicate columns are already in the 
output projection, the arrow-rs `RowFilter` (late materialization) machinery 
provides **zero I/O benefit** — those columns must be decoded for the 
projection anyway. Yet the RowFilter adds substantial CPU overhead from 
`CachedArrayReader`, `ReadPlanBuilder::with_predicate`, and 
`ParquetDecoderState::try_next_batch` (~1100 extra CPU samples on Q10 
flamegraph). This causes regressions on 15 of the 43 ClickBench queries.
   
   See [profiling 
details](https://github.com/apache/datafusion/issues/20324#issuecomment-3917344339).
   
   ## What changes are included in this PR?
   
   In `opener.rs`, before calling `build_row_filter()`, check whether all 
predicate column indices are a subset of the projection column indices. If so:
   - Skip `build_row_filter()` entirely (no RowFilter overhead)
   - Apply the predicate as a vectorized batch filter post-decode using 
`batch_filter()`
   - Filter out empty batches from the stream
   
   If not a subset (i.e., there are non-projected columns that could be 
skipped), proceed with the RowFilter path as before.
   
   ClickBench results on key regression queries (pushdown ON, fix vs baseline):
   - **Q19**: 0.46x vs baseline (fully fixed — faster than pushdown OFF)
   - **Q26**: 0.53x vs baseline (fully fixed)
   - **Q10, Q11, Q25**: 12-19% improvement vs baseline
   
   ## Are these changes tested?
   
   Yes. Added `test_skip_row_filter_when_filter_cols_subset_of_projection` 
which validates:
   1. Batch filter path (filter cols ⊆ projection): correct row counts and 
values
   2. RowFilter path (filter cols ⊄ projection): correct filtered values
   3. Batch filter with no matches: 0 rows, 0 batches (empty batches filtered)
   
   All existing tests pass (81 tests in `datafusion-datasource-parquet`).
   
   ## Are there any user-facing changes?
   
   No. Behavior is identical — queries return the same results. Performance 
improves for queries where filter columns overlap with projection columns when 
`pushdown_filters = true`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to