[ 
https://issues.apache.org/jira/browse/ARROW-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256640#comment-17256640
 ] 

Andy Grove commented on ARROW-11058:
------------------------------------

I can see your argument [~jorgecarleitao]

I can see how we could parallelize operations across batches within one or more 
partition in some cases rather than repartitioning to increase parallelism.

If an operator requires its input to be in a specific order we would need to 
fall back to single-threaded behavior per partition though (SortMergeJoin, 
SortAggregate, etc).

It is possible I am just too familiar with how this is normally done in other 
query engines but I see partitions as the unit of parallelism and batches are 
just there so we can do vectorized processing and we need to manage the size of 
the batches for efficient processing (this is more important on GPU than CPU 
though).

I think if/when we get to distributed queries in DataFusion, the partition 
model is even more important, especially when reading Parquet files that are 
partitioned based on keys, so we keep processing of related data together in 
the same partitions/nodes.

I'd be interested to hear what others have to say on all of this, of course.

 

 

 

> [Rust] [DataFusion] Implement "coalesce batches" operator
> ---------------------------------------------------------
>
>                 Key: ARROW-11058
>                 URL: https://issues.apache.org/jira/browse/ARROW-11058
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Andy Grove
>            Assignee: Andy Grove
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> When we have a FilterExec in the plan, it can produce lots of small batches 
> and we therefore lose efficiency of vectorized operations.
> We should implement a new CoalesceBatchExec and wrap every FilterExec with 
> one of these so that small batches can be recombined into larger batches to 
> improve the efficiency of upstream operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to