[
https://issues.apache.org/jira/browse/ARROW-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256640#comment-17256640
]
Andy Grove commented on ARROW-11058:
------------------------------------
I can see your argument [~jorgecarleitao]
I can see how we could parallelize operations across batches within one or more
partition in some cases rather than repartitioning to increase parallelism.
If an operator requires its input to be in a specific order we would need to
fall back to single-threaded behavior per partition though (SortMergeJoin,
SortAggregate, etc).
It is possible I am just too familiar with how this is normally done in other
query engines but I see partitions as the unit of parallelism and batches are
just there so we can do vectorized processing and we need to manage the size of
the batches for efficient processing (this is more important on GPU than CPU
though).
I think if/when we get to distributed queries in DataFusion, the partition
model is even more important, especially when reading Parquet files that are
partitioned based on keys, so we keep processing of related data together in
the same partitions/nodes.
I'd be interested to hear what others have to say on all of this, of course.
> [Rust] [DataFusion] Implement "coalesce batches" operator
> ---------------------------------------------------------
>
> Key: ARROW-11058
> URL: https://issues.apache.org/jira/browse/ARROW-11058
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Andy Grove
> Assignee: Andy Grove
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.0.0
>
> Time Spent: 3h 20m
> Remaining Estimate: 0h
>
> When we have a FilterExec in the plan, it can produce lots of small batches
> and we therefore lose efficiency of vectorized operations.
> We should implement a new CoalesceBatchExec and wrap every FilterExec with
> one of these so that small batches can be recombined into larger batches to
> improve the efficiency of upstream operators.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)