Andy Grove commented on ARROW-6659:

Yes, basically what I am suggesting is that the HashAggregateExec never merges 
and just aggregates each partition.

When we create the physical plan from the logical plan, we should have the 
logic there to know that if we are aggregating a partitioned data source then 
we need to merge and then aggregate again ... so yes, moving logic from 
HashAggregateExec to create_physical_plan, as you said.

Currently the logical plan doesn't know about partitioning though ... so we'd 
need to add that info.

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -------------------------------------------------------------------------
>                 Key: ARROW-6659
>                 URL: https://issues.apache.org/jira/browse/ARROW-6659
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Rust, Rust - DataFusion
>            Reporter: Andy Grove
>            Priority: Major
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
>     - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>    - MergeExec
>       - DistributedExec
>          - HashAggregate // aggregate per partition{code}

This message was sent by Atlassian Jira

Reply via email to