Hello,

The IPC file format is defined as the IPC stream format, preceded by a header (the Arrow magic bytes) and followed by a footer (a catalog of record batches, and the Arrow magic bytes). Thus, reading and writing IPC files can reuse the same basic building blocks as for IPC streams (this is almost trivial for writing, which is usually done sequentially).

As a consequence, IPC files practically result in valid identical IPC streams (ignoring the 8 header bytes) that read as the same logical contents.

However, there is no theoretical guarantee that this is always the case. Consider a IPC file writer that would write record batches in reverse order in the footer, compared to their sequential order in the underlying stream. Or, more generally, an IPC file footer that would repeat or skip some batches in the stream.

So theoretically, we cannot assume that reading an IPC file as an IPC stream (after skipping the 8 header bytes) returns the intended contents.

However, it seems that it could be useful to be able to make such an assumption. Hence these questions:
1. Do all current IPC file writers uphold this assumption?
2. Do we want to make it a more explicit requirement of the IPC file format?


Context: I've submitted a PR (https://github.com/apache/arrow/pull/49312) to enable differential fuzzing in the C++ IPC file fuzzer, where I'm comparing the results of the IPC file and stream readers on the fuzzing payload.

Regards

Antoine.

Reply via email to