zhuqi-lucas commented on PR #21580:
URL: https://github.com/apache/datafusion/pull/21580#issuecomment-4266618248
> Based on this PR and morsel improvements, I am also thinking we could
initialize the TopK statistics from column stats (at least for single columns)
and make the initial threshold much tighter based on min/max statistics (at
file / rowgroup/page level):
>
> * We have a file/rowgroup with more than K (from TopK) amount of rows
> * We have a single sort column (directly after scan)
> * We can initialize/update the TopK using max (or min) statistics
> * Also, if the new bound is smaller / bigger than the current TopK, we
could update it to the tighter bound
>
> This I think might help making initial threshold much tighter instead of
having to read all the first row groups using not-initialized TopK.
Great idea @Dandandan , this eliminates the cold start problem completely!
With stats-based initialization, the threshold is tight before reading any
data, so combined with RG reorder and page index, a LIMIT 10 on a 6M row file
could read just a single page.
I see you already created #21691 to track this. This composes nicely with
the existing optimization chain (#21580 → #21399 → #21691). Would love to help
with the implementation!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]