Ziheng Wang created ARROW-17380:
-----------------------------------

             Summary: Tag record batches with start_byte and end_byte 
infromation
                 Key: ARROW-17380
                 URL: https://issues.apache.org/jira/browse/ARROW-17380
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
            Reporter: Ziheng Wang
            Assignee: Ziheng Wang


It might be desirable for a record batch to have information of where it came 
from in the source dataset. This can be used for a few purposes:
 * Rereading a particular record batch without rereading the entire fragment
 * Easily tracking progress of how much a particular (file) dataset has been 
consumed. 

It could also be useful for debugging if a record batch resulted in an error 
downstream. 

The plan is to add some attribute like this here: 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L923]
 that will be tagged on to the record batch by the Scanner as it is being 
generated.

This is useful for file based formats like CSV. In Parquet this is less 
necessary since record batches (usually) correspond to row groups and row group 
ids can be used to serve this function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to