Re: row counts in footer of IPC file format

Weston Pace Mon, 20 Mar 2023 08:22:12 -0700

+1, I'm generally in favor of the idea.  I would prefer
`recordBatchNumRows` (or, less favorably, `recordBatchSize`).  I don't
think `recordBatchLengths` works because there are already places in the
footer where "length" is interpreted as "number of bytes".


I'm not an expert on flatbuffers evolution but I wonder if we want to
create a new struct (table?) so that new statistics can be added in the
future if we desire.

```
struct RecordBatchStatistics {
  // Number of rows in the batch
  num_rows: [long];
  // Open spot for extending with new statistics such as min/max,
cardinality, bloom filter, etc.
}

...

recordBatchStatistics: [RecordBatchStatistics];
```

On Sat, Mar 18, 2023 at 9:39 PM Steve Kim <chairm...@gmail.com> wrote:

> Hello everyone,
>
> I would like to be able to quickly seek to an arbitrary row in an Arrow
> file.
>
> With the current file format, reading the file footer alone is not enough
> to
> determine the record batch that contains a given row index. The row counts
> of the record batches are only found in the metadata for each record batch,
> which are scattered at different offsets in the file. Multiple
> non-contiguous small reads can be costly (e.g., HTTP GET requests to read
> byte ranges from a S3 object).
>
> This problem has been discussed in GitHub issues:
>
> https://github.com/apache/arrow/issues/18250
>
> https://github.com/apache/arrow/issues/24575
>
> To solve this problem, I propose a small backwards-compatible change to the
> file format. We can add a
>
>     recordBatchLengths: [long];
>
> field to the Footer table (
> https://github.com/apache/arrow/blob/main/format/File.fbs). The name and
> type of this new field match the length field in the RecordBatch table.
> This new field must be after the custom_metadata field in the Footer table
> to satisfy constraints of FlatBuffers schema evolution. An Arrow file whose
> footer lacks the recordBatchLengths field would be read with a default
> value of null, which indicates that the row counts are not present.
>
> What do people think?
>
> Thanks,
> Steve
>

Re: row counts in footer of IPC file format

Reply via email to