Re: [PR] feat: globally reorder files and row groups by statistics for TopK queries [datafusion]

via GitHub Thu, 30 Apr 2026 03:17:37 -0700


zhuqi-lucas commented on PR #21956:
URL: https://github.com/apache/datafusion/pull/21956#issuecomment-4351600543


   The benchmark results are expected — RG reorder alone doesn't skip any row 
groups, it only changes the read order so that TopK's dynamic filter threshold 
converges faster.
   
   The significant speedup (2-3x on `sort_pushdown_inexact`) comes from **stats 
init + cumulative RG prune** which will be in the follow-up PR. Those 
optimizations depend on RG reorder as a foundation:
   
   1. **RG reorder**: put best RGs first (this PR)
   2. **Stats init**: initialize TopK threshold from RG statistics before 
reading → prune RGs upfront (next PR)
   3. **Cumulative prune**: after reorder, truncate remaining RGs once enough 
rows are collected (next PR)
   
   Without reorder, cumulative prune might truncate the wrong RGs. Reorder 
ensures the best RGs come first, making truncation safe and effective.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: globally reorder files and row groups by statistics for TopK queries [datafusion]

Reply via email to