Lawrence Chan commented on ARROW-2296:
Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk
all the batches to sum them up.
Also they are indeed in the existing RecordBatch metadata, but the current
implementation is inside a .cc file and I'd have to either copy+paste or modify
my build to expose more of the existing code. Maybe we could expose something
like this on the RecordBatchFileReader?
Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata)
Then it'd be possible to read the length fields without copying some of the
other stuff. Not sure if this is a good idea though, since it seems that we
dont usually expose the flatbuffers through the public API. Maybe just a
int64_t num_rows() const;
is all I really want, and that can read the new Footer field once it's in
there, and walk the batches in the current format?
> [C++] Add num_rows to file footer
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Format
> Reporter: Lawrence Chan
> Priority: Minor
> Maybe I'm overlooking something, but I don't see something on the API surface
> to get the number of rows in a arrow file without reading all the record
> batches. This is useful when we want to read into contiguous buffers, because
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be
> nice to have a method that iterates over the record batch headers and sums up
> the lengths without reading the actual record batch body.
This message was sent by Atlassian JIRA