Clarifying how MapNode works

Li Jin Thu, 19 May 2022 07:37:58 -0700

Hello!

I have some questions about MapNode in Arrow compute and hope to get some
clarification if my understanding is correct.


In Arrow compute, map nodes are effectively "pure functions" that map one
input batch to one output batch. By pure function I mean that map nodes
don't manage extra thread/thread pool, nor does it utilize any existing
worker pool (even though the method in MapNode is called SubmitTask, it is
really doing the computation with the caller thread, instead of
"Submitting" it to some other thread). Whichever thread calls
MapNode::InputReceived, the map expression is executed by that thread, and
if there are multiple map node chained together, the thread that calls
InputReceived is effectively doing a

map_fn3(map_fn2(map_fn1(input_batch)))

Each map func is a kernel that operates on arrays and materializes
intermediate results. If there are multiple columns and only a small subset
is touched by the map nodes, effectively the rest of the columns are just
passed by pointers (i.e., the intermediate batches just copy the pointers
of the untouched columns and not the data itself). And perhaps those
untouched columns should not be loaded into CPU caches? (Because the kernel
is operating on them?).

Appreciate your help,
Li

Clarifying how MapNode works

Reply via email to