IIRC Struct's are immutable once defined, if you want to evolve, then Tables are necessary.
On Mon, Mar 20, 2023 at 8:22 AM Weston Pace <weston.p...@gmail.com> wrote: > +1, I'm generally in favor of the idea. I would prefer > `recordBatchNumRows` (or, less favorably, `recordBatchSize`). I don't > think `recordBatchLengths` works because there are already places in the > footer where "length" is interpreted as "number of bytes". > > I'm not an expert on flatbuffers evolution but I wonder if we want to > create a new struct (table?) so that new statistics can be added in the > future if we desire. > > ``` > struct RecordBatchStatistics { > // Number of rows in the batch > num_rows: [long]; > // Open spot for extending with new statistics such as min/max, > cardinality, bloom filter, etc. > } > > ... > > recordBatchStatistics: [RecordBatchStatistics]; > ``` > > On Sat, Mar 18, 2023 at 9:39 PM Steve Kim <chairm...@gmail.com> wrote: > > > Hello everyone, > > > > I would like to be able to quickly seek to an arbitrary row in an Arrow > > file. > > > > With the current file format, reading the file footer alone is not enough > > to > > determine the record batch that contains a given row index. The row > counts > > of the record batches are only found in the metadata for each record > batch, > > which are scattered at different offsets in the file. Multiple > > non-contiguous small reads can be costly (e.g., HTTP GET requests to read > > byte ranges from a S3 object). > > > > This problem has been discussed in GitHub issues: > > > > https://github.com/apache/arrow/issues/18250 > > > > https://github.com/apache/arrow/issues/24575 > > > > To solve this problem, I propose a small backwards-compatible change to > the > > file format. We can add a > > > > recordBatchLengths: [long]; > > > > field to the Footer table ( > > https://github.com/apache/arrow/blob/main/format/File.fbs). The name and > > type of this new field match the length field in the RecordBatch table. > > This new field must be after the custom_metadata field in the Footer > table > > to satisfy constraints of FlatBuffers schema evolution. An Arrow file > whose > > footer lacks the recordBatchLengths field would be read with a default > > value of null, which indicates that the row counts are not present. > > > > What do people think? > > > > Thanks, > > Steve > > >