Re: [Discuss] Equivalence of IPC file and stream formats

Dewey Dunnington Tue, 17 Feb 2026 15:18:03 -0800

Thanks for raising this!

I agree with Joel and I think it's quite useful. As a concrete example of
where this is used, because nanoarrow doesn't officially support the file
format (only to the extent needed for integration testing), it has allowed
Arrow files to be read by DuckDB's arrow extension (by skipping the first 8
bytes and pretending it's a stream). This is great for sources where random
access is harder to support but where there is some advantage to supplying
the footer for clients that can take advantage (e.g., statically hosting a
file via http).


Cheers,

-dewey

On Tue, Feb 17, 2026 at 1:37 PM Joel Lubinitsky <[email protected]> wrote:

> Hi Antoine, thanks for raising this.
>
> There is a test [1] in the Go implementation that validates this behavior.
> It writes an IPC file, then ensures the same data is read back using a
> stream reader starting 8 bytes past the start of the buffer.
>
> I have personally seen and written code that assumes the embedded stream
> can be safely read to consume the contents of the IPC file. There is also
> an open PR [2] adding integration tests for related behavior, but it was
> never merged.
>
> IMHO the flexibility to consume an IPC file as a stream improves its value
> compared to alternatives. Combined with existing usage relying on this
> assumption, my preference would be toward formalizing this as an explicit
> requirement.
>
> [1]
>
> https://github.com/apache/arrow-go/blob/8d81fc39254b7a51daf1fe1a272c24169a059878/arrow/ipc/file_test.go#L83
> [2] https://github.com/apache/arrow/pull/43834
>
> Thanks,
> Joel
>
> On Tue, Feb 17, 2026 at 1:20 PM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Hello,
> >
> > The IPC file format is defined as the IPC stream format, preceded by a
> > header (the Arrow magic bytes) and followed by a footer (a catalog of
> > record batches, and the Arrow magic bytes). Thus, reading and writing
> > IPC files can reuse the same basic building blocks as for IPC streams
> > (this is almost trivial for writing, which is usually done sequentially).
> >
> > As a consequence, IPC files practically result in valid identical IPC
> > streams (ignoring the 8 header bytes) that read as the same logical
> > contents.
> >
> > However, there is no theoretical guarantee that this is always the case.
> > Consider a IPC file writer that would write record batches in reverse
> > order in the footer, compared to their sequential order in the
> > underlying stream. Or, more generally, an IPC file footer that would
> > repeat or skip some batches in the stream.
> >
> > So theoretically, we cannot assume that reading an IPC file as an IPC
> > stream (after skipping the 8 header bytes) returns the intended contents.
> >
> > However, it seems that it could be useful to be able to make such an
> > assumption. Hence these questions:
> > 1. Do all current IPC file writers uphold this assumption?
> > 2. Do we want to make it a more explicit requirement of the IPC file
> > format?
> >
> >
> > Context: I've submitted a PR
> > (https://github.com/apache/arrow/pull/49312) to enable differential
> > fuzzing in the C++ IPC file fuzzer, where I'm comparing the results of
> > the IPC file and stream readers on the fuzzing payload.
> >
> > Regards
> >
> > Antoine.
> >
> >
>

Re: [Discuss] Equivalence of IPC file and stream formats

Reply via email to