Ziheng Wang created ARROW-17380:
-----------------------------------
Summary: Tag record batches with start_byte and end_byte
infromation
Key: ARROW-17380
URL: https://issues.apache.org/jira/browse/ARROW-17380
Project: Apache Arrow
Issue Type: Improvement
Components: C++, Python
Reporter: Ziheng Wang
Assignee: Ziheng Wang
It might be desirable for a record batch to have information of where it came
from in the source dataset. This can be used for a few purposes:
* Rereading a particular record batch without rereading the entire fragment
* Easily tracking progress of how much a particular (file) dataset has been
consumed.
It could also be useful for debugging if a record batch resulted in an error
downstream.
The plan is to add some attribute like this here:
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L923]
that will be tagged on to the record batch by the Scanner as it is being
generated.
This is useful for file based formats like CSV. In Parquet this is less
necessary since record batches (usually) correspond to row groups and row group
ids can be used to serve this function.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)