[
https://issues.apache.org/jira/browse/ARROW-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518666#comment-17518666
]
Weston Pace commented on ARROW-16138:
-------------------------------------
Some suggestions I have heard:
* We could separate dispatch from execution, allowing us to "prepare" an
expression (do all the dispatch work) and then execute the expression many
times.
* We could avoid allocation for temporary buffers created during scalar
execution. For example, if the expression is x > 0 && x < 20 then we know we
will need two boolean temporary arrays in addition to the input array (x) and
the output array (boolean). If we defined a "max batch size" then we could
preallocate these two temporary arrays once and reuse them for every execution.
* In the execution engine we may be able to preallocate the output buffer in
some cases. For example, if we have filter -> project then we know the output
buffer is only ever going to be used as input to the project. This means, at
plan creation time, we could allocate this filter->project buffer once (per
thread) and then reuse it for every execution. This would mean we would need
some way to pass a preallocated output array to ExecuteScalarExpression.
> [C++] Improve performance of ExecuteScalarExpression
> ----------------------------------------------------
>
> Key: ARROW-16138
> URL: https://issues.apache.org/jira/browse/ARROW-16138
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> One of the things we want to be able to do in the streaming execution engine
> is process data in small L2 sized batches. Based on literature we might like
> to use batches somewhere in the range of 1k to 16k rows. In ARROW-16014 we
> created a benchmark to measure the performance of ExecuteScalarExpression as
> the size of our batches got smaller. There are two things we observed:
> * Something is causing thread contention. We should be able to get pretty
> close to perfect linear speedup when we are evaluating scalar expressions and
> the batch size fits entirely into L2. We are not seeing that.
> * The overhead of ExecuteScalarExpression is too high when processing small
> batches. Even when the expression is doing real work (e.g. copies,
> comparisons) the execution time starts to be dominated by overhead when we
> have 10k sized batches.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)