Re: [PR] feat: globally reorder files and row groups by statistics for TopK queries [datafusion]

via GitHub Sat, 16 May 2026 01:06:25 -0700


zhuqi-lucas commented on PR #21956:
URL: https://github.com/apache/datafusion/pull/21956#issuecomment-4466266314


   > [#21956 
(comment)](https://github.com/apache/datafusion/pull/21956#issuecomment-4458821315)
   > 
   > @zhuqi-lucas these benches are good but not all that impressive. I'm not 
sure if it's just that they are already trivially fast (<10ms) or if we need to 
tweak them.
   
   Yeah @adriangb , this is expected:
   
   
    The wins are bounded by what `Inexact` + `TopK` (and reverse-`TopK`) can do 
— there's an inherent ceiling, and this PR pushes against it rather than 
breaking past it.
   
     Concretely: in the `Inexact` path, `LIMIT N` doesn't propagate as a static 
stop signal to the source. The dynamic-filter pushdown can stats-prune 
*subsequent* row groups once the heap-driven threshold tightens,
     which already does a lot of work — but inside the row group the source is 
currently reading, the sort column still has to be **fully decoded** so the 
filter can be evaluated row by row. **Phase 3's stats reorder this PR**
     makes that threshold tighten on the first RG instead of after a few, but 
the per-RG cost floor stays. That's the ceiling.
   
     Internally we still keep our own RG-level `Exact` reverse running in 
production — it's net faster end-to-end on this shape because deleting 
`SortExec` turns `LIMIT` into a static fetch on the source and removes
     the heap / per-row filter / final-ordering overhead. We can't land 
RG-level `Exact` upstream though: parquet doesn't allow partial row-group 
reads, so it inherits the same ~128 MB peak that closed #18817.
   
     The fix is **page-level `Exact` reverse via apache/arrow-rs#9937**. Seek 
to the last page via `OffsetIndex`, decode forward, reverse the batch, emit — 
peak memory drops to one page (~1 MB), and the static-fetch
     early termination is preserved at page granularity. Once that primitive 
ships, the DataFusion-side integration can replace the `Inexact` + `TopK` shape 
for `ORDER BY ts DESC LIMIT N` outright, and the ceiling
     we're hitting today goes away — without the RG-level memory regression.
   
     Separately for the forward `Inexact` path (**Phase 3 stats reorder, not 
reverse-specific this PR**): if we get a **row-group-granular morsel API** 
later — the shared work queue (#21351) currently holds `PartitionedFile`, but
      if it held `(file, rg_idx)` instead, an idle partition could release its 
current file's remaining RGs back to the pool once a global "K satisfied" 
signal fires — Phase 3's reorder path itself would get true
     cross-partition early termination too. That removes the 
"currently-decoding RG must finish" floor on the `Inexact` side as well. Not in 
this PR, clean follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: globally reorder files and row groups by statistics for TopK queries [datafusion]

Reply via email to