adriangb commented on PR #21580:
URL: https://github.com/apache/datafusion/pull/21580#issuecomment-4268423915

   > Another idea (maybe even more impactful/generally applicable than the pure 
- stats):
   > 
   > Currently many queries go like Scan -> Filter -> Aggregate(Partial) -> 
Aggregate(Final) -> TopK
   > 
   > Needs to first go through the the full aggregation before it reaches Topk 
(so it's not useful to push down to scan as we usually scan the entire dataset 
before emitting the partial aggregate)
   > 
   > We can update bounds from aggregate, once we detect at least K distinct 
rows in a Partial aggregate.
   > 
   > In practice it would mean after the first batch / batches arrive we 
already have e.g. 10 rows, so can already prune / filter with the much tighter 
bound.
   
   I wonder if we could do `Scan -> Filter -> Aggregate(Partial) -> TopK 
(partition) -> Aggregate(Final) -> TopK (final)`?
   
   The point is that if the partial agg starts emitting partial groups the 
partition topk can start pushing down bounds based on those groups.
   
   I looked into connecting the partial agg to topk and it was messy, they kind 
of need to know about each other, it feels like it's basically a fused operator.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to