Weston Pace created ARROW-12523:
-----------------------------------
Summary: [C++] [Dataset] Remove buffering from AsyncScanner
Key: ARROW-12523
URL: https://issues.apache.org/jira/browse/ARROW-12523
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
The MakeEnumeratedGenerator operator buffers blocks by 1 so it can properly
mark a block as "last" (e.g. when it receives an EOF it releases the last
block, marks it last, and then releases an EOF).
However, this adds complexity (this is very evident in the testing for
unordered scan) and could potentially disrupt cache locality. For example, a
thread will receive batch X, parse & decode batch X, then filter and project
batch X-1.
We could push the responsibility of tagging the last batch/fragment into the
readers themselves or we could release an empty "last" batch which serves as a
token to the later resequencer (think of it as an end-of-fragment token in
addition to the end-of-scan token we already have).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)