westonpace commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1120079201
The second thing you will often see mentioned is the "morsel / batch" model.
When reading data in you often want to read it in largish blocks of data
(counter-intuitively, these large blocks are referred to as "morsels"). This
can lead to an execution engine that is roughly:
```
parallel for morsel in data_source:
for operator in pipeline:
morsel = operator(morsel)
send_to_sink(morsel)
```
However, since each operator is often going over the same data, and morsels
are often bigger than CPU caches, this can be inefficient. Instead the ideal
approach is:
```
for morsel in data_source:
parallel for l2_batch in morsel:
for operator in operators:
l2_batch = operator(l2_batch)
send_to_sink(l2_batch)
```
This is the model we are trying to work towards in the current execution
engine (the hash join is pretty close, the projection and filter nodes still
have some work to do before they can handle small batches).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]