Hi Antoine, thanks for raising this. There is a test [1] in the Go implementation that validates this behavior. It writes an IPC file, then ensures the same data is read back using a stream reader starting 8 bytes past the start of the buffer.
I have personally seen and written code that assumes the embedded stream can be safely read to consume the contents of the IPC file. There is also an open PR [2] adding integration tests for related behavior, but it was never merged. IMHO the flexibility to consume an IPC file as a stream improves its value compared to alternatives. Combined with existing usage relying on this assumption, my preference would be toward formalizing this as an explicit requirement. [1] https://github.com/apache/arrow-go/blob/8d81fc39254b7a51daf1fe1a272c24169a059878/arrow/ipc/file_test.go#L83 [2] https://github.com/apache/arrow/pull/43834 Thanks, Joel On Tue, Feb 17, 2026 at 1:20 PM Antoine Pitrou <[email protected]> wrote: > > Hello, > > The IPC file format is defined as the IPC stream format, preceded by a > header (the Arrow magic bytes) and followed by a footer (a catalog of > record batches, and the Arrow magic bytes). Thus, reading and writing > IPC files can reuse the same basic building blocks as for IPC streams > (this is almost trivial for writing, which is usually done sequentially). > > As a consequence, IPC files practically result in valid identical IPC > streams (ignoring the 8 header bytes) that read as the same logical > contents. > > However, there is no theoretical guarantee that this is always the case. > Consider a IPC file writer that would write record batches in reverse > order in the footer, compared to their sequential order in the > underlying stream. Or, more generally, an IPC file footer that would > repeat or skip some batches in the stream. > > So theoretically, we cannot assume that reading an IPC file as an IPC > stream (after skipping the 8 header bytes) returns the intended contents. > > However, it seems that it could be useful to be able to make such an > assumption. Hence these questions: > 1. Do all current IPC file writers uphold this assumption? > 2. Do we want to make it a more explicit requirement of the IPC file > format? > > > Context: I've submitted a PR > (https://github.com/apache/arrow/pull/49312) to enable differential > fuzzing in the C++ IPC file fuzzer, where I'm comparing the results of > the IPC file and stream readers on the fuzzing payload. > > Regards > > Antoine. > >
