Weston Pace created ARROW-11800:
-----------------------------------
Summary: [C++] Add unordered scan
Key: ARROW-11800
URL: https://issues.apache.org/jira/browse/ARROW-11800
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
Currently Scan generates an ordered sequence of batches. However, this is not
ideal. For example, consider reading four parquet files in parallel from S3.
There is no good way to determine which read will finish first. If file 2
finishes before file 1 then we could start parsing the contents of file 2
immediately but we currently do not.
There could then be an option provided by Scan whether to preserve ordering or
not. Cases that do not care about ordering (e.g. count rows) could take
advantage of this to reduce memory pressure.
Note: This will be an optimization even for cases that do care about ordering.
We could still parse / project / etc. out of order and simply reorder at the
end. The only difference between unordered and ordered then will be the memory
pressure applied.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)