[ 
https://issues.apache.org/jira/browse/ARROW-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256096#comment-17256096
 ] 

Andy Grove commented on ARROW-11058:
------------------------------------

[~jorgecarleitao]

I think that the PR [https://github.com/apache/arrow/pull/9043] will help 
explain this, but Apache Spark actually does do something very similar. Spark 
has partitions which are the unit of parallelism (either on threads or 
executors) and each partition is an iterator[T].

Spark supports row-based (Iterator[Row]) and column-based operators 
(Iterator[ColumnarBatch]) operators out-the-box although most of the built-in 
operators are row-based. Spark will insert transitions as required to convert 
between row and column-based operators.

Because filters can produce empty batches or batches with a single or small 
number of rows, we lose some efficiency both with SIMD and also just due to 
per-batch overheads in particular kernels (as we have seen with 
MutableArrayData). 

Small batches can also be inefficient when writing out to Parquet because we 
lose the benefits of compression to some degree, so this is another use case 
where we would want to coalesce them.

Coalescing batches is especially important for GPU if we ever add support for 
that because the cost of an operation on GPU is the same (once data is loaded) 
regardless of how many items it is operating on, so it is beneficial to operate 
on as much data in parallel as possible.

> [Rust] [DataFusion] Implement "coalesce batches" operator
> ---------------------------------------------------------
>
>                 Key: ARROW-11058
>                 URL: https://issues.apache.org/jira/browse/ARROW-11058
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Andy Grove
>            Assignee: Andy Grove
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When we have a FilterExec in the plan, it can produce lots of small batches 
> and we therefore lose efficiency of vectorized operations.
> We should implement a new CoalesceBatchExec and wrap every FilterExec with 
> one of these so that small batches can be recombined into larger batches to 
> improve the efficiency of upstream operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to