Hello everyone,

I would like to be able to quickly seek to an arbitrary row in an Arrow
file.

With the current file format, reading the file footer alone is not enough to
determine the record batch that contains a given row index. The row counts
of the record batches are only found in the metadata for each record batch,
which are scattered at different offsets in the file. Multiple
non-contiguous small reads can be costly (e.g., HTTP GET requests to read
byte ranges from a S3 object).

This problem has been discussed in GitHub issues:

https://github.com/apache/arrow/issues/18250

https://github.com/apache/arrow/issues/24575

To solve this problem, I propose a small backwards-compatible change to the
file format. We can add a

    recordBatchLengths: [long];

field to the Footer table (
https://github.com/apache/arrow/blob/main/format/File.fbs). The name and
type of this new field match the length field in the RecordBatch table.
This new field must be after the custom_metadata field in the Footer table
to satisfy constraints of FlatBuffers schema evolution. An Arrow file whose
footer lacks the recordBatchLengths field would be read with a default
value of null, which indicates that the row counts are not present.

What do people think?

Thanks,
Steve

Reply via email to