Re: [PR] feat: reorder row groups by statistics during sort pushdown [datafusion]

via GitHub Fri, 17 Apr 2026 01:50:22 -0700


zhuqi-lucas commented on PR #21580:
URL: https://github.com/apache/datafusion/pull/21580#issuecomment-4266618248


   > Based on this PR and morsel improvements, I am also thinking we could 
initialize the TopK statistics from column stats (at least for single columns) 
and make the initial threshold much tighter based on min/max statistics (at 
file / rowgroup/page level):
   > 
   > * We have a file/rowgroup with more than K (from TopK) amount of rows
   > * We have a single sort column (directly after scan)
   > * We can initialize/update the TopK using max (or min) statistics
   > * Also, if the new bound is smaller / bigger than the current TopK, we 
could update it to the tighter bound
   > 
   > This I think might help making initial threshold much tighter instead of 
having to read all the first row groups using not-initialized TopK.
   
    Great idea @Dandandan ,  this eliminates the cold start problem completely! 
With stats-based initialization, the threshold is tight before reading any 
data, so combined with RG reorder and page index, a LIMIT 10 on a 6M row file 
could read just a single page.
   
   I see you already created #21691 to track this. This composes nicely with 
the existing optimization chain (#21580 → #21399 → #21691). Would love to help 
with the implementation!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: reorder row groups by statistics during sort pushdown [datafusion]

Reply via email to