darmie commented on issue #20324:
URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3917344339

   I profiled a large category of regressions and have a fix for them. Sharing 
findings below.
   
   ### Filter columns ⊆ projection columns: zero I/O benefit from RowFilter
   
   15 of the regressing ClickBench queries (Q10-Q22, Q25, Q27) filter on a 
column that is also in the `SELECT` projection. When all filter columns are 
already projected, the RowFilter provides no I/O savings, those columns must be 
decoded regardless. The overhead is pure loss.
   
   Flamegraph for Q10 (`WHERE MobilePhoneModel <> '' ... SELECT 
MobilePhoneModel`):
   
   Pushdown OFF: 
https://gist.github.com/darmie/2f59c391fddbfd9709d2d2fc162ff764#file-flamegraph_q10_off-svg
   Pushdown ON: 
https://gist.github.com/darmie/2f59c391fddbfd9709d2d2fc162ff764#file-flamegraph_q10_on-svg
   
   Three hot functions appear in the pushdown ON path totaling ~1100 extra CPU 
samples:
   - `try_next_batch` (406) — state machine orchestration
   - `ReadPlanBuilder::with_predicate` (370) — per-batch mask construction
   - `CachedArrayReader::fetch_batch` (296) — predicate column caching
   
   Meanwhile only ~72 samples are saved from removing `FilterExec`. EXPLAIN 
ANALYZE confirms `bytes_scanned` is identical in both modes.
   
   This is related to @Dandandan's observation that the last RowFilter never 
saves I/O when `#row_filters >= #projected_columns`. My fix is the simpler 
structural check: skip `build_row_filter()` entirely when 
`predicate_col_indices ⊆ projection_col_indices`, and apply the predicate as a 
vectorized batch filter post-decode instead.
   
   Results on key queries (pushdown ON with fix):
   - **Q19, Q26**: fully fixed — no regression vs pushdown OFF
   - **Q10, Q11, Q25**: 12-19% improvement vs baseline ON, residual regression 
from batch_filter overhead vs bare FilterExec
   
   I would open a PR with this fix soon.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to