Hey all, I have drafted a PR to update the format docs with clarifications around equivalence and deviations between IPC files and IPC streams: https://github.com/apache/arrow/pull/49947
Can people take a look and make comments/suggestions? Once there is enough consensus I will launch a formal vote to get these changes approved. Regards Antoine. On 2026/02/17 18:19:32 Antoine Pitrou wrote: > > Hello, > > The IPC file format is defined as the IPC stream format, preceded by a > header (the Arrow magic bytes) and followed by a footer (a catalog of > record batches, and the Arrow magic bytes). Thus, reading and writing > IPC files can reuse the same basic building blocks as for IPC streams > (this is almost trivial for writing, which is usually done sequentially). > > As a consequence, IPC files practically result in valid identical IPC > streams (ignoring the 8 header bytes) that read as the same logical > contents. > > However, there is no theoretical guarantee that this is always the case. > Consider a IPC file writer that would write record batches in reverse > order in the footer, compared to their sequential order in the > underlying stream. Or, more generally, an IPC file footer that would > repeat or skip some batches in the stream. > > So theoretically, we cannot assume that reading an IPC file as an IPC > stream (after skipping the 8 header bytes) returns the intended contents. > > However, it seems that it could be useful to be able to make such an > assumption. Hence these questions: > 1. Do all current IPC file writers uphold this assumption? > 2. Do we want to make it a more explicit requirement of the IPC file format? > > > Context: I've submitted a PR > (https://github.com/apache/arrow/pull/49312) to enable differential > fuzzing in the C++ IPC file fuzzer, where I'm comparing the results of > the IPC file and stream readers on the fuzzing payload. > > Regards > > Antoine. > >
