[
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947901#comment-16947901
]
Kyle McCarthy commented on ARROW-6659:
--------------------------------------
Do you have a specific solution in mind? I was thinking that this could be done
by pulling some of the logic out from the partitions method in the
HashAggregateExec, but also it probably could work with generics.
> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -------------------------------------------------------------------------
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: Rust, Rust - DataFusion
> Reporter: Andy Grove
> Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for
> the initial aggregate per partition, and then explicitly calls MergeExec and
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For
> example, it is not possible to provide a different MergeExec implementation
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the
> physical plan contains explicit steps such as:
>
> {code:java}
> - HashAggregate // final aggregate
> - MergeExec
> - HashAggregate // aggregate per partition
> {code}
> This would then make it easier to customize the plan in other projects, to
> support distributed execution:
> {code:java}
> - HashAggregate // final aggregate
> - MergeExec
> - DistributedExec
> - HashAggregate // aggregate per partition{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)