[ 
https://issues.apache.org/jira/browse/ARROW-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519080#comment-17519080
 ] 

Weston Pace commented on ARROW-16138:
-------------------------------------

> Have we profiled to see where the overhead is? (Though I suppose it may not 
> matter, if we just want to get rid of it all.)

No, but I do think profiling would be a good idea.  Even if we find the 
bottleneck is in some "dispatch" phase that we can get rid of it would be good 
to prove that first before we start throwing solutions at it.  Mostly I was 
jotting these ideas down before I forget them.  [~zagto] is planning on looking 
into this further.

> We may need to do some work to enable more kernels to be able to take 
> advantage of preallocated buffers.
> Not all currently do and it's not necessarily clear which are which (so even 
> if you could preallocate the output
> array in ExecuteScalarExpression, the kernel might discard it anyways).

Good point.  Some kernels will never support preallocation I think too.  For 
example, if we are dealing with any variable length arrays like strings we 
won't necessarily know a "max buffer size" even if we know a "max batch size".

> For the first suggestion: what is dispatch referring to here? Resolving the 
> kernel? I thought binding an expression also resolved the kernel, I may be 
> wrong

The benchmark was running a bound expression.  However, I will admit that I 
have almost no idea how this process works :).  It's possible that there is 
nothing wrong with the dispatch mechanism itself and something related to the 
individual kernel execution.  We did try several different expressions in the 
benchmark.



> [C++] Improve performance of ExecuteScalarExpression
> ----------------------------------------------------
>
>                 Key: ARROW-16138
>                 URL: https://issues.apache.org/jira/browse/ARROW-16138
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> One of the things we want to be able to do in the streaming execution engine 
> is process data in small L2 sized batches.  Based on literature we might like 
> to use batches somewhere in the range of 1k to 16k rows.  In ARROW-16014 we 
> created a benchmark to measure the performance of ExecuteScalarExpression as 
> the size of our batches got smaller.  There are two things we observed:
>  * Something is causing thread contention.  We should be able to get pretty 
> close to perfect linear speedup when we are evaluating scalar expressions and 
> the batch size fits entirely into L2.  We are not seeing that.
>  * The overhead of ExecuteScalarExpression is too high when processing small 
> batches.  Even when the expression is doing real work (e.g. copies, 
> comparisons) the execution time starts to be dominated by overhead when we 
> have 10k sized batches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to