[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

Andy Grove (Jira) Sun, 13 Oct 2019 09:54:24 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950387#comment-16950387
 ]


Andy Grove commented on ARROW-6659:
-----------------------------------

Hi [~kylemccarthy] ... I think you should create a PR based on the change you 
already made first, to move the logic from HashAggregate into the physical 
query planner, then we can look at the next step of moving the logic to the 
logical plan which is a much larger change.

To answer your questions though, partition count is basically how many 
partitions the input data source has, which usually means how many files are 
there in the directory. Batch count is unrelated and is just how many rows are 
read at a time.

The example logical plan you provided is correct.

So the LogicalPlan for a CSV data source that is a directory containing 4 files 
would be 4.

The LogicalPlan that aggregates that would also have 4 partitions (same as the 
input).

LogicalPlan::Merge always has partition count 1.

Hope that helps.

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -------------------------------------------------------------------------
>
>                 Key: ARROW-6659
>                 URL: https://issues.apache.org/jira/browse/ARROW-6659
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Rust, Rust - DataFusion
>            Reporter: Andy Grove
>            Assignee: Kyle McCarthy
>            Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
>     - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>    - MergeExec
>       - DistributedExec
>          - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

Reply via email to