[GitHub] [arrow-datafusion] alamb commented on issue #5133: Support StreamAggregation

via GitHub Wed, 08 Feb 2023 09:08:17 -0800


alamb commented on issue #5133:
URL: 
https://github.com/apache/arrow-datafusion/issues/5133#issuecomment-1422957942

> PTAL. Detail design for fully stream aggregation will be given in the next
following days.

Thank you @xiaoyong-z -- I read your addition and left some comments.
Overall I think it is a great idea.

Here is one possibly approach to implementation (perhaps what you had in
mind):
1. Implement `StreamAggregate` that handles pre-sorted data (where the data
is already sorted according to the grouping keys/ partition keys).
2. Remove `AggregateStream`
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/no_grouping.rs
and replace its use in the optimizer with the new StreamAggregate operator
3. Update the optimizer to recognize when the input to a GroupByHash is
sorted appropriately and switch to using the AggregateStream operator.

I think that would get us pretty far.

I am not sure about the idea of "sort the data first and then run the stream
aggregator" -- as I mentioned in the document I think it is unlikely that
approach will be better in terms of overall memory usage or performance.

When we want to support spilling group by (external group by) that is when
sort might be beneficial.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #5133: Support StreamAggregation

Reply via email to