[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

Andy Grove (Jira) Thu, 10 Oct 2019 19:48:17 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949075#comment-16949075
 ]


Andy Grove commented on ARROW-6659:
-----------------------------------

Yes, I was thinking that the LogicalPlan would have a new Merge variant and 
also would know about partitioning. Initially we can just add a partition_count 
attribute to each variant in the enum, Later we might want to add extra 
information about partitions such as size, because eventually we may want to 
explore cost-based optimizations.

However, I like what you are doing here as well since it also meets my main 
goal of completely removing any merging from HashAggregateExec .... maybe we 
should do your approach first and then see if there is a need to move this to 
the logical plan in the future? Nice work :)

 

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -------------------------------------------------------------------------
>
>                 Key: ARROW-6659
>                 URL: https://issues.apache.org/jira/browse/ARROW-6659
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Rust, Rust - DataFusion
>            Reporter: Andy Grove
>            Assignee: Kyle McCarthy
>            Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
>     - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>    - MergeExec
>       - DistributedExec
>          - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

Reply via email to