alamb commented on issue #5133:
URL: 
https://github.com/apache/arrow-datafusion/issues/5133#issuecomment-1422957942

   > PTAL. Detail design for fully stream aggregation will be given in the next 
following days.
   
   Thank you @xiaoyong-z  -- I read your addition and left some comments. 
Overall I think it is a great idea. 
   
   Here is one possibly approach to implementation (perhaps what you had in 
mind):
   1. Implement `StreamAggregate` that handles pre-sorted data (where the data 
is already sorted according to the grouping keys/ partition keys). 
   2.  Remove  `AggregateStream` 
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/no_grouping.rs
  and replace its use in the optimizer with the new StreamAggregate operator
   3. Update the optimizer to recognize when the input to a GroupByHash is 
sorted appropriately and switch to using the AggregateStream operator.
   
   I think that would get us pretty far. 
   
   I am not sure about the idea of "sort the data first and then run the stream 
aggregator" -- as I mentioned in the document I think it is unlikely that 
approach will be better in terms of overall memory usage or performance. 
   
   When we want to support spilling group by (external group by) that is when 
sort might be beneficial. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to