Re: [PR] feat(parquet): runtime row-group early stop via TopK dynamic filter [datafusion]

via GitHub Tue, 16 Jun 2026 07:08:56 -0700


zhuqi-lucas commented on PR #22450:
URL: https://github.com/apache/datafusion/pull/22450#issuecomment-4719660563


   Update on this — I prototyped the per-RG toggle: push `fully_matched` 
through `PreparedAccessPlan` into the per-RG `rg_plan` entry, then at each 
row-group boundary call 
`decoder.into_builder().with_row_filter(empty_or_real).build()` to flip the 
filter. Limit budget carries across the rebuild correctly. But it exposed a 
latent invariant in `push_decoder.rs` worth flagging.
   
   The current `rg_plan` assumes a 1:1 correspondence between 
`try_next_reader()` results and the head of the plan — we `pop_front()` once 
per reader. That holds when every RG in the plan actually yields a reader. It 
does **not** hold when page-index pruning eliminates every page of an RG: 
arrow-rs's `try_next_reader` silently advances past the empty RG and returns 
the reader for the next one. Concretely, on `limit_pruning.slt`'s `species > 
'M' AND s >= 50 ORDER BY species LIMIT 3`:
   
   - Initial plan `[RG1 (not fm), RG2 (fm), RG3 (not fm)]`, filter installed.
   - arrow-rs page-index-prunes RG 1 entirely (`page_index_pages_pruned=2`).
   - `try_next_reader` returns the reader for **RG 2**; my code pops RG 1, 
leaving plan `[RG2, RG3]`.
   - Next boundary sees `front = RG2 (fm)`, triggers a toggle to empty filter, 
and rebuilds with `with_row_groups([2, 3])` — the new decoder reads RG 2 
**again**.
   - Scan output: 6 rows instead of 3 (final query result still correct because 
TopK ranks them, but the metric is off).
   
   Without the toggle, this off-by-one was harmless — we never used 
`rg_plan.front()` to redirect the decoder, so the wrong-head condition was 
invisible. The toggle made it observable.
   
   Fixing it cleanly needs one of:
   
   1. An arrow-rs hook to query the decoder's *actual* next-RG index after 
`try_next_reader` (doesn't exist today).
   2. Pre-computing which RGs page-index will fully eliminate at file open and 
dropping them from `rg_plan` (replicates arrow-rs internals — fragile).
   3. Tracking position by RG metadata identity rather than queue order.
   
   I'm leaving the per-RG toggle out of this PR and will pick it up as a 
follow-up. The practical perf gap is narrow: arrow-rs's page-level 
\`page_index_pages_skipped_by_fully_matched\` already skips per-row filter eval 
on fully-matched RGs whenever the page index is populated — i.e. for any RG 
large enough to span multiple data pages, which is the typical production 
layout. The case this PR doesn't (yet) recover is small RGs where no page index 
helps, plus predicate columns that the page index doesn't cover. Happy to file 
an issue and tackle it after this lands.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(parquet): runtime row-group early stop via TopK dynamic filter [datafusion]

Reply via email to