Dandandan commented on PR #21580:
URL: https://github.com/apache/datafusion/pull/21580#issuecomment-4266553697

   Based on this PR and morsel improvements, I am also thinking we could 
initialize the TopK statistics from column stats (at least for single columns) 
and make the initial threshold much tighter based on min/max statistics (at 
file / rowgroup/page level):
   
   * We have a file/rowgroup with more than K (from TopK) amount of rows
   * We have a single order by column (directly after scan)
   * We can initialize/update the TopK using max (or min) statistics
   * Also, if the new bound is smaller / bigger than the current TopK, we could 
update it to the tighter bound
   
   This I think might help making initial threshold much tighter instead of 
having to read all the first row groups using not-initialized TopK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to