+1, I'm generally in favor of the idea. I would prefer `recordBatchNumRows` (or, less favorably, `recordBatchSize`). I don't think `recordBatchLengths` works because there are already places in the footer where "length" is interpreted as "number of bytes".
I'm not an expert on flatbuffers evolution but I wonder if we want to create a new struct (table?) so that new statistics can be added in the future if we desire. ``` struct RecordBatchStatistics { // Number of rows in the batch num_rows: [long]; // Open spot for extending with new statistics such as min/max, cardinality, bloom filter, etc. } ... recordBatchStatistics: [RecordBatchStatistics]; ``` On Sat, Mar 18, 2023 at 9:39 PM Steve Kim <chairm...@gmail.com> wrote: > Hello everyone, > > I would like to be able to quickly seek to an arbitrary row in an Arrow > file. > > With the current file format, reading the file footer alone is not enough > to > determine the record batch that contains a given row index. The row counts > of the record batches are only found in the metadata for each record batch, > which are scattered at different offsets in the file. Multiple > non-contiguous small reads can be costly (e.g., HTTP GET requests to read > byte ranges from a S3 object). > > This problem has been discussed in GitHub issues: > > https://github.com/apache/arrow/issues/18250 > > https://github.com/apache/arrow/issues/24575 > > To solve this problem, I propose a small backwards-compatible change to the > file format. We can add a > > recordBatchLengths: [long]; > > field to the Footer table ( > https://github.com/apache/arrow/blob/main/format/File.fbs). The name and > type of this new field match the length field in the RecordBatch table. > This new field must be after the custom_metadata field in the Footer table > to satisfy constraints of FlatBuffers schema evolution. An Arrow file whose > footer lacks the recordBatchLengths field would be read with a default > value of null, which indicates that the row counts are not present. > > What do people think? > > Thanks, > Steve >