Dandandan commented on PR #21580: URL: https://github.com/apache/datafusion/pull/21580#issuecomment-4266553697
Based on this PR and morsel improvements, I am also thinking we could initialize the TopK statistics from column stats (at least for single columns) and make the initial threshold much tighter based on min/max statistics (at file / rowgroup/page level): * We have a file/rowgroup with more than K (from TopK) amount of rows * We have a single order by column (directly after scan) * We can initialize/update the TopK using max (or min) statistics * Also, if the new bound is smaller / bigger than the current TopK, we could update it to the tighter bound This I think might help making initial threshold much tighter instead of having to read all the first row groups using not-initialized TopK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
