darmie commented on issue #20324: URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3917344339
I profiled a large category of regressions and have a fix for them. Sharing findings below. ### Filter columns ⊆ projection columns: zero I/O benefit from RowFilter 15 of the regressing ClickBench queries (Q10-Q22, Q25, Q27) filter on a column that is also in the `SELECT` projection. When all filter columns are already projected, the RowFilter provides no I/O savings, those columns must be decoded regardless. The overhead is pure loss. Flamegraph for Q10 (`WHERE MobilePhoneModel <> '' ... SELECT MobilePhoneModel`): Pushdown OFF: https://gist.github.com/darmie/2f59c391fddbfd9709d2d2fc162ff764#file-flamegraph_q10_off-svg Pushdown ON: https://gist.github.com/darmie/2f59c391fddbfd9709d2d2fc162ff764#file-flamegraph_q10_on-svg Three hot functions appear in the pushdown ON path totaling ~1100 extra CPU samples: - `try_next_batch` (406) — state machine orchestration - `ReadPlanBuilder::with_predicate` (370) — per-batch mask construction - `CachedArrayReader::fetch_batch` (296) — predicate column caching Meanwhile only ~72 samples are saved from removing `FilterExec`. EXPLAIN ANALYZE confirms `bytes_scanned` is identical in both modes. This is related to @Dandandan's observation that the last RowFilter never saves I/O when `#row_filters >= #projected_columns`. My fix is the simpler structural check: skip `build_row_filter()` entirely when `predicate_col_indices ⊆ projection_col_indices`, and apply the predicate as a vectorized batch filter post-decode instead. Results on key queries (pushdown ON with fix): - **Q19, Q26**: fully fixed — no regression vs pushdown OFF - **Q10, Q11, Q25**: 12-19% improvement vs baseline ON, residual regression from batch_filter overhead vs bare FilterExec I would open a PR with this fix soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
