Re: [PR] feat: reorder row groups by statistics during sort pushdown [datafusion]

via GitHub Fri, 17 Apr 2026 01:38:26 -0700


Dandandan commented on PR #21580:
URL: https://github.com/apache/datafusion/pull/21580#issuecomment-4266553697


   Based on this PR and morsel improvements, I am also thinking we could 
initialize the TopK statistics from column stats (at least for single columns) 
and make the initial threshold much tighter based on min/max statistics (at 
file / rowgroup/page level):
   
   * We have a file/rowgroup with more than K (from TopK) amount of rows
   * We have a single order by column (directly after scan)
   * We can initialize/update the TopK using max (or min) statistics
   * Also, if the new bound is smaller / bigger than the current TopK, we could 
update it to the tighter bound
   
   This I think might help making initial threshold much tighter instead of 
having to read all the first row groups using not-initialized TopK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: reorder row groups by statistics during sort pushdown [datafusion]

Reply via email to