Weston Pace created ARROW-13599:
-----------------------------------
Summary: [C++] [Dataset] Add optional scan type that tags batches
with locational information
Key: ARROW-13599
URL: https://issues.apache.org/jira/browse/ARROW-13599
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
Currently there are two types of scans:
* Ordered scan - Yields batches in order (includes batch index and fragment
index)
* Unordered scan - Yields batches in any order (no batch index or fragment
index)
There is a third type of scan (Tagged scan? Indexed scan?) which could tag
each batch with the starting row # of the batch. Certain file types (like
parquet & IPC) should be able to support this with similar performance to an
unordered scan (since the # of rows is in the metadata).
Other file types (like CSV) could fall back to an ordered scan or do something
like a two pass approach to count the # of newlines in a file and then scan the
file itself (not sure if this makes sense yet).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)