[
https://issues.apache.org/jira/browse/ARROW-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256096#comment-17256096
]
Andy Grove commented on ARROW-11058:
------------------------------------
[~jorgecarleitao]
I think that the PR [https://github.com/apache/arrow/pull/9043] will help
explain this, but Apache Spark actually does do something very similar. Spark
has partitions which are the unit of parallelism (either on threads or
executors) and each partition is an iterator[T].
Spark supports row-based (Iterator[Row]) and column-based operators
(Iterator[ColumnarBatch]) operators out-the-box although most of the built-in
operators are row-based. Spark will insert transitions as required to convert
between row and column-based operators.
Because filters can produce empty batches or batches with a single or small
number of rows, we lose some efficiency both with SIMD and also just due to
per-batch overheads in particular kernels (as we have seen with
MutableArrayData).
Small batches can also be inefficient when writing out to Parquet because we
lose the benefits of compression to some degree, so this is another use case
where we would want to coalesce them.
Coalescing batches is especially important for GPU if we ever add support for
that because the cost of an operation on GPU is the same (once data is loaded)
regardless of how many items it is operating on, so it is beneficial to operate
on as much data in parallel as possible.
> [Rust] [DataFusion] Implement "coalesce batches" operator
> ---------------------------------------------------------
>
> Key: ARROW-11058
> URL: https://issues.apache.org/jira/browse/ARROW-11058
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Andy Grove
> Assignee: Andy Grove
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> When we have a FilterExec in the plan, it can produce lots of small batches
> and we therefore lose efficiency of vectorized operations.
> We should implement a new CoalesceBatchExec and wrap every FilterExec with
> one of these so that small batches can be recombined into larger batches to
> improve the efficiency of upstream operators.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)