Hello everyone, I would like to be able to quickly seek to an arbitrary row in an Arrow file.
With the current file format, reading the file footer alone is not enough to determine the record batch that contains a given row index. The row counts of the record batches are only found in the metadata for each record batch, which are scattered at different offsets in the file. Multiple non-contiguous small reads can be costly (e.g., HTTP GET requests to read byte ranges from a S3 object). This problem has been discussed in GitHub issues: https://github.com/apache/arrow/issues/18250 https://github.com/apache/arrow/issues/24575 To solve this problem, I propose a small backwards-compatible change to the file format. We can add a recordBatchLengths: [long]; field to the Footer table ( https://github.com/apache/arrow/blob/main/format/File.fbs). The name and type of this new field match the length field in the RecordBatch table. This new field must be after the custom_metadata field in the Footer table to satisfy constraints of FlatBuffers schema evolution. An Arrow file whose footer lacks the recordBatchLengths field would be read with a default value of null, which indicates that the row counts are not present. What do people think? Thanks, Steve