[jira] [Comment Edited] (ARROW-16178) [C++] Add a ThreadLocalState concept built on thread local

Weston Pace (Jira) Wed, 13 Apr 2022 11:32:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521868#comment-17521868
 ]


Weston Pace edited comment on ARROW-16178 at 4/13/22 6:31 PM:
--------------------------------------------------------------

Yes.  Reconciliation is not always needed but you are correct about the scratch 
spaces.  For example, let's assume we have an int32 array {{x}} and an 
expression {{x < 20 && x > 10}} and we are implementing a filter node.

This translates roughly to...

{noformat}
const auto& x = batch.get("x");
auto y = call("lt", {x, 20});
auto z = call("gt", {x, 10});
auto filter = call("and", {y, z});
auto out = call("take", {batch, filter});
{noformat}

For each batch that passes through the nodel we have to heap-allocate {{y}}, 
{{z}}, {{filter}}, and {{out}}.  However, if our max batch size is defined 
(let's say 10k rows) then we can preallocate {{y}}, {{z}} and {{filter}} at 
plan creation time.  In fact, since this node has only a single output we can 
even preallocate {{out}}.

At the moment, for the filter node, this discussion is largely theoretical.  
The kernel infrastructure isn't yet ready to receive a preallocated output 
buffer.  In the hash-join node however (and possibly the aggregate node), this 
sort of pattern is actively being used and there is a need for a more efficient 
way of accessing the thread local data.


was (Author: westonpace):
Yes.  Reconciliation is not always needed but you are correct about the scratch 
spaces.  For example, let's assume we have an int32 array {{x}} and an 
expression {{x < 20 && x > 10}} and we are implementing a filter node.

This translates roughly to...

{noformat}
const auto& x = batch.get("x");
auto y = call("lt", {x, 20});
auto z = call("gt", {x, 10});
auto filter = call("and", {y, z});
auto out = call("take", {batch, filter});
{noformat}

Each call we have to heap-allocate {{y}}, {{z}}, {{filter}}, and {{out}}.  
However, if our max batch size is defined (let's say 10k rows) then we can 
preallocate {{y}}, {{z}} and {{filter}} at plan creation time.  In fact, since 
this node has only a single output we can even preallocate {{out}}.

At the moment, for the filter node, this discussion is largely theoretical.  
The kernel infrastructure isn't yet ready to receive a preallocated output 
buffer.  In the hash-join node however (and possibly the aggregate node), this 
sort of pattern is actively being used and there is a need for a more efficient 
way of accessing the thread local data.

> [C++] Add a ThreadLocalState concept built on thread local
> ----------------------------------------------------------
>
>                 Key: ARROW-16178
>                 URL: https://issues.apache.org/jira/browse/ARROW-16178
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The ThreadLocalState is tied to an executor and, on creation, creates a state 
> for every thread in the executor.  In order to quickly access a particular 
> thread's state we need a way  to get a thread index (the index of the thread 
> in the executor).  Historically we used ThreadIndexer and this JIRA 
> introduces a new approach using thread local.
> Similar to the ThreadIndexer this thread local state concept will fail when 
> the capacity is resized during a run.
> Similar to the ThreadIndexer this concept won't work too well for serial 
> execution until ARROW-15732 is resolved.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-16178) [C++] Add a ThreadLocalState concept built on thread local

Reply via email to