alamb commented on issue #5133: URL: https://github.com/apache/arrow-datafusion/issues/5133#issuecomment-1422957942
> PTAL. Detail design for fully stream aggregation will be given in the next following days. Thank you @xiaoyong-z -- I read your addition and left some comments. Overall I think it is a great idea. Here is one possibly approach to implementation (perhaps what you had in mind): 1. Implement `StreamAggregate` that handles pre-sorted data (where the data is already sorted according to the grouping keys/ partition keys). 2. Remove `AggregateStream` https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/no_grouping.rs and replace its use in the optimizer with the new StreamAggregate operator 3. Update the optimizer to recognize when the input to a GroupByHash is sorted appropriately and switch to using the AggregateStream operator. I think that would get us pretty far. I am not sure about the idea of "sort the data first and then run the stream aggregator" -- as I mentioned in the document I think it is unlikely that approach will be better in terms of overall memory usage or performance. When we want to support spilling group by (external group by) that is when sort might be beneficial. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
