Hello,
The IPC file format is defined as the IPC stream format, preceded by a
header (the Arrow magic bytes) and followed by a footer (a catalog of
record batches, and the Arrow magic bytes). Thus, reading and writing
IPC files can reuse the same basic building blocks as for IPC streams
(this is almost trivial for writing, which is usually done sequentially).
As a consequence, IPC files practically result in valid identical IPC
streams (ignoring the 8 header bytes) that read as the same logical
contents.
However, there is no theoretical guarantee that this is always the case.
Consider a IPC file writer that would write record batches in reverse
order in the footer, compared to their sequential order in the
underlying stream. Or, more generally, an IPC file footer that would
repeat or skip some batches in the stream.
So theoretically, we cannot assume that reading an IPC file as an IPC
stream (after skipping the 8 header bytes) returns the intended contents.
However, it seems that it could be useful to be able to make such an
assumption. Hence these questions:
1. Do all current IPC file writers uphold this assumption?
2. Do we want to make it a more explicit requirement of the IPC file format?
Context: I've submitted a PR
(https://github.com/apache/arrow/pull/49312) to enable differential
fuzzing in the C++ IPC file fuzzer, where I'm comparing the results of
the IPC file and stream readers on the fuzzing payload.
Regards
Antoine.