[
https://issues.apache.org/jira/browse/ARROW-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521868#comment-17521868
]
Weston Pace edited comment on ARROW-16178 at 4/13/22 6:31 PM:
--------------------------------------------------------------
Yes. Reconciliation is not always needed but you are correct about the scratch
spaces. For example, let's assume we have an int32 array {{x}} and an
expression {{x < 20 && x > 10}} and we are implementing a filter node.
This translates roughly to...
{noformat}
const auto& x = batch.get("x");
auto y = call("lt", {x, 20});
auto z = call("gt", {x, 10});
auto filter = call("and", {y, z});
auto out = call("take", {batch, filter});
{noformat}
For each batch that passes through the nodel we have to heap-allocate {{y}},
{{z}}, {{filter}}, and {{out}}. However, if our max batch size is defined
(let's say 10k rows) then we can preallocate {{y}}, {{z}} and {{filter}} at
plan creation time. In fact, since this node has only a single output we can
even preallocate {{out}}.
At the moment, for the filter node, this discussion is largely theoretical.
The kernel infrastructure isn't yet ready to receive a preallocated output
buffer. In the hash-join node however (and possibly the aggregate node), this
sort of pattern is actively being used and there is a need for a more efficient
way of accessing the thread local data.
was (Author: westonpace):
Yes. Reconciliation is not always needed but you are correct about the scratch
spaces. For example, let's assume we have an int32 array {{x}} and an
expression {{x < 20 && x > 10}} and we are implementing a filter node.
This translates roughly to...
{noformat}
const auto& x = batch.get("x");
auto y = call("lt", {x, 20});
auto z = call("gt", {x, 10});
auto filter = call("and", {y, z});
auto out = call("take", {batch, filter});
{noformat}
Each call we have to heap-allocate {{y}}, {{z}}, {{filter}}, and {{out}}.
However, if our max batch size is defined (let's say 10k rows) then we can
preallocate {{y}}, {{z}} and {{filter}} at plan creation time. In fact, since
this node has only a single output we can even preallocate {{out}}.
At the moment, for the filter node, this discussion is largely theoretical.
The kernel infrastructure isn't yet ready to receive a preallocated output
buffer. In the hash-join node however (and possibly the aggregate node), this
sort of pattern is actively being used and there is a need for a more efficient
way of accessing the thread local data.
> [C++] Add a ThreadLocalState concept built on thread local
> ----------------------------------------------------------
>
> Key: ARROW-16178
> URL: https://issues.apache.org/jira/browse/ARROW-16178
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: C++
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> The ThreadLocalState is tied to an executor and, on creation, creates a state
> for every thread in the executor. In order to quickly access a particular
> thread's state we need a way to get a thread index (the index of the thread
> in the executor). Historically we used ThreadIndexer and this JIRA
> introduces a new approach using thread local.
> Similar to the ThreadIndexer this thread local state concept will fail when
> the capacity is resized during a run.
> Similar to the ThreadIndexer this concept won't work too well for serial
> execution until ARROW-15732 is resolved.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)