willtemperley commented on issue #47824: URL: https://github.com/apache/arrow/issues/47824#issuecomment-3420684950
@lidavidm Thanks for clarifying this. Yes exactly, `metadata_size` in the encapsulated message format and `metadata_size` in the `Block` struct in the footer refer to _almost_ the same thing, except the one in the footer is the `metadata_size` plus the encapsulated message format prefix length. I think this is definitely confusing! Reading [encapsulated-message-format](https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format) : > IPC File Format > > We define a “file format” supporting random access that is an extension of the stream format. The file starts and ends with a magic string ARROW1 (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a footer containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access to any record batch in the file. See [File.fbs](https://github.com/apache/arrow/blob/main/format/File.fbs) for the precise details of the file footer. So looking at File.fbs we have: ``` struct Block { /// Index to the start of the RecordBlock (note this is past the Message header) offset: long; /// Length of the metadata metaDataLength: int; /// Length of the data (this is aligned so there can be a gap between this and /// the metadata). bodyLength: long; } ``` We have `metaDataLength` aka `metadata_size` but no mention of the prefix. Perhaps this could be made explicit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
