zhuqi-lucas commented on PR #21956: URL: https://github.com/apache/datafusion/pull/21956#issuecomment-4466266314
> [#21956 (comment)](https://github.com/apache/datafusion/pull/21956#issuecomment-4458821315) > > @zhuqi-lucas these benches are good but not all that impressive. I'm not sure if it's just that they are already trivially fast (<10ms) or if we need to tweak them. Yeah @adriangb , this is expected: The wins are bounded by what `Inexact` + `TopK` (and reverse-`TopK`) can do — there's an inherent ceiling, and this PR pushes against it rather than breaking past it. Concretely: in the `Inexact` path, `LIMIT N` doesn't propagate as a static stop signal to the source. The dynamic-filter pushdown can stats-prune *subsequent* row groups once the heap-driven threshold tightens, which already does a lot of work — but inside the row group the source is currently reading, the sort column still has to be **fully decoded** so the filter can be evaluated row by row. **Phase 3's stats reorder this PR** makes that threshold tighten on the first RG instead of after a few, but the per-RG cost floor stays. That's the ceiling. Internally we still keep our own RG-level `Exact` reverse running in production — it's net faster end-to-end on this shape because deleting `SortExec` turns `LIMIT` into a static fetch on the source and removes the heap / per-row filter / final-ordering overhead. We can't land RG-level `Exact` upstream though: parquet doesn't allow partial row-group reads, so it inherits the same ~128 MB peak that closed #18817. The fix is **page-level `Exact` reverse via apache/arrow-rs#9937**. Seek to the last page via `OffsetIndex`, decode forward, reverse the batch, emit — peak memory drops to one page (~1 MB), and the static-fetch early termination is preserved at page granularity. Once that primitive ships, the DataFusion-side integration can replace the `Inexact` + `TopK` shape for `ORDER BY ts DESC LIMIT N` outright, and the ceiling we're hitting today goes away — without the RG-level memory regression. Separately for the forward `Inexact` path (**Phase 3 stats reorder, not reverse-specific this PR**): if we get a **row-group-granular morsel API** later — the shared work queue (#21351) currently holds `PartitionedFile`, but if it held `(file, rg_idx)` instead, an idle partition could release its current file's remaining RGs back to the pool once a global "K satisfied" signal fires — Phase 3's reorder path itself would get true cross-partition early termination too. That removes the "currently-decoding RG must finish" floor on the `Inexact` side as well. Not in this PR, clean follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
