darmie opened a new pull request, #20417: URL: https://github.com/apache/datafusion/pull/20417
## Which issue does this PR close? - Closes part of #20324 (addresses the "filter columns ⊆ projection columns" category of regressions). - Related: #20325 (Q10 investigation) ## Rationale for this change When `pushdown_filters = true` and all predicate columns are already in the output projection, the arrow-rs `RowFilter` (late materialization) machinery provides **zero I/O benefit** — those columns must be decoded for the projection anyway. Yet the RowFilter adds substantial CPU overhead from `CachedArrayReader`, `ReadPlanBuilder::with_predicate`, and `ParquetDecoderState::try_next_batch` (~1100 extra CPU samples on Q10 flamegraph). This causes regressions on 15 of the 43 ClickBench queries. See [profiling details](https://github.com/apache/datafusion/issues/20324#issuecomment-3917344339). ## What changes are included in this PR? In `opener.rs`, before calling `build_row_filter()`, check whether all predicate column indices are a subset of the projection column indices. If so: - Skip `build_row_filter()` entirely (no RowFilter overhead) - Apply the predicate as a vectorized batch filter post-decode using `batch_filter()` - Filter out empty batches from the stream If not a subset (i.e., there are non-projected columns that could be skipped), proceed with the RowFilter path as before. ClickBench results on key regression queries (pushdown ON, fix vs baseline): - **Q19**: 0.46x vs baseline (fully fixed — faster than pushdown OFF) - **Q26**: 0.53x vs baseline (fully fixed) - **Q10, Q11, Q25**: 12-19% improvement vs baseline ## Are these changes tested? Yes. Added `test_skip_row_filter_when_filter_cols_subset_of_projection` which validates: 1. Batch filter path (filter cols ⊆ projection): correct row counts and values 2. RowFilter path (filter cols ⊄ projection): correct filtered values 3. Batch filter with no matches: 0 rows, 0 batches (empty batches filtered) All existing tests pass (81 tests in `datafusion-datasource-parquet`). ## Are there any user-facing changes? No. Behavior is identical — queries return the same results. Performance improves for queries where filter columns overlap with projection columns when `pushdown_filters = true`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
