Hi Rusty,

Ideally IPC stream multiplexing should be done at the transport level, for example using QUIC instead of TCP.

Regards

Antoine.



Le 03/02/2026 à 21:07, Rusty Conover a écrit :
Hi Dewey,

While thinking about this in a café in Amsterdam, another idea came to mind: 
most ApplicationData does have structure.

It could be interesting to support multiplexing multiple IPC streams over the 
same socket. One way to do this would be to tag IPC messages with a destination 
stream ID, and have the IPC reader/writer emit messages annotated with that ID. 
On the receiving side, read_next() could yield messages from any active stream, 
leaving it to the user to interpret both the stream ID and the message itself.

This is more of a thought experiment than a concrete proposal, since it would 
likely add significant complexity—but it felt related enough to mention.

Best,

Rusty

On Tue, Feb 3, 2026, at 8:58 PM, Dewey Dunnington wrote:
Just a note that I think a dedicated ApplicationData message is a great
idea. Many spatial file formats embed a spatial index, which is too large
for schema metadata...in many cases it already exists in data I'd like to
stream over IPC and there's no good place to put it so it just gets dropped
by the producer and recomputed by the consumer. I had considered at one
point prototyping an arrow-based spatial format that placed this type of
data in an Arrow file with the extra spatial information after the EOS and
before the file footer; however, ApplicationData would be a much cleaner
approach. There are many instances of custom file formats built on top of
SQLite and I wonder if ApplicationData would open up something like that
for Arrow IPC (beyond just my spatial concept).

Cheers,

-dewey

On Tue, Feb 3, 2026 at 1:28 PM Rusty Conover <[email protected]> wrote:

Hi Antoine,

It is nice to hear from you!

(I would perhaps also call it "application data" or something)

I’m happy with ApplicationData as the name.

On the face of it, this looks like a reasonable idea, though I wonder if
it should be a separate message type *or* an optional field carried
together in RecordBatches.

The main issue with carrying this in RecordBatch metadata is ordering.
While IPC already supports `custom_metadata` via `write_batch` (which I’ve
been using), that approach assumes the application data can be attached to
a specific batch.

In some cases, the application data and record batches are produced
independently and cannot be cleanly associated. A concrete example is
interleaving stderr output (arbitrary log messages) with record batches
written to stdout, while preserving a single ordered IPC stream.

I experimented with using zero-row record batches as a workaround, but
this is inefficient: even with no rows, the serialized message size grows
with schema complexity. I’ve measured this across several schemas; details
and code are here:

https://gist.github.com/rustyconover/6ff8cbd93369735287d80ae60436379e

In short, zero-row batches can cost anywhere from ~120 bytes for simple
schemas to ~450+ bytes for more complex ones, which makes this approach
unattractive when trying to minimize bytes on the wire.

For these reasons, a distinct IPC message type for application data seems
like the cleanest solution. I’d be very interested in whether others have
run into the need for this as well.

Rusty


On Tue, Feb 3, 2026, at 5:58 PM, Antoine Pitrou wrote:
Hi Rusty,



Regards

Antoine.


Le 03/02/2026 à 17:31, Rusty Conover a écrit :
Hi Arrow Friends,

I’ve really appreciated Arrow Flight’s ability to carry custom metadata
messages alongside record batches. In some of my current work, however, I’m
dealing with Arrow IPC streams that are *not* sent via Flight, and I’d like
to have a comparable capability there as well.

To support this, I’d like to propose adding a new IPC message
type—tentatively named `*OpaqueBytes*`—that would allow arbitrary bytes to
be embedded directly within IPC streams. IPC readers that do not understand
this message type could safely ignore it, preserving compatibility.

My motivation is to enable multiplexing of auxiliary messages within a
stream that otherwise consists of schemas, dictionaries, and record
batches. A concrete example would be interleaving logging or signaling
messages with record batches. Today, I’m approximating this by emitting
zero-row record batches with binary metadata, but this approach is awkward
and incurs unnecessary overhead due to schema complexity.

An `OpaqueBytes` IPC message type could enable a range of use cases,
including (but not limited to) logging, flow control, signaling, and other
auxiliary communication needs that don’t naturally map to record batches.

I briefly discussed this idea a few weeks ago on the Apache Arrow call,
but wanted to share it here to reach a broader audience and gather more
feedback.

In addition to the message type itself, I’d also be interested in
hearing thoughts on how PyArrow’s interfaces might be extended to allow
users to read and write these arbitrary messages as part of existing IPC
stream readers and writers.

Looking forward to your thoughts and discussion.

Kind regards,
Rusty


Reply via email to