Re: [DISCUSS] IPC MessageType - OpaqueBytes

Rusty Conover Tue, 03 Feb 2026 11:28:07 -0800

Hi Antoine,

It is nice to hear from you!

> (I would perhaps also call it "application data" or something)

I’m happy with ApplicationData as the name.

> On the face of it, this looks like a reasonable idea, though I wonder if 
> it should be a separate message type *or* an optional field carried 
> together in RecordBatches.

The main issue with carrying this in RecordBatch metadata is ordering. While 
IPC already supports `custom_metadata` via `write_batch` (which I’ve been 
using), that approach assumes the application data can be attached to a 
specific batch.

In some cases, the application data and record batches are produced 
independently and cannot be cleanly associated. A concrete example is 
interleaving stderr output (arbitrary log messages) with record batches written 
to stdout, while preserving a single ordered IPC stream.

I experimented with using zero-row record batches as a workaround, but this is 
inefficient: even with no rows, the serialized message size grows with schema 
complexity. I’ve measured this across several schemas; details and code are 
here:

https://gist.github.com/rustyconover/6ff8cbd93369735287d80ae60436379e

In short, zero-row batches can cost anywhere from ~120 bytes for simple schemas 
to ~450+ bytes for more complex ones, which makes this approach unattractive 
when trying to minimize bytes on the wire.

For these reasons, a distinct IPC message type for application data seems like 
the cleanest solution. I’d be very interested in whether others have run into 
the need for this as well.

Rusty

On Tue, Feb 3, 2026, at 5:58 PM, Antoine Pitrou wrote:
> Hi Rusty,
>
>
>
> Regards
>
> Antoine.
>
>
> Le 03/02/2026 à 17:31, Rusty Conover a écrit :
>> Hi Arrow Friends,
>> 
>> I’ve really appreciated Arrow Flight’s ability to carry custom metadata 
>> messages alongside record batches. In some of my current work, however, I’m 
>> dealing with Arrow IPC streams that are *not* sent via Flight, and I’d like 
>> to have a comparable capability there as well.
>> 
>> To support this, I’d like to propose adding a new IPC message 
>> type—tentatively named `*OpaqueBytes*`—that would allow arbitrary bytes to 
>> be embedded directly within IPC streams. IPC readers that do not understand 
>> this message type could safely ignore it, preserving compatibility.
>> 
>> My motivation is to enable multiplexing of auxiliary messages within a 
>> stream that otherwise consists of schemas, dictionaries, and record batches. 
>> A concrete example would be interleaving logging or signaling messages with 
>> record batches. Today, I’m approximating this by emitting zero-row record 
>> batches with binary metadata, but this approach is awkward and incurs 
>> unnecessary overhead due to schema complexity.
>> 
>> An `OpaqueBytes` IPC message type could enable a range of use cases, 
>> including (but not limited to) logging, flow control, signaling, and other 
>> auxiliary communication needs that don’t naturally map to record batches.
>> 
>> I briefly discussed this idea a few weeks ago on the Apache Arrow call, but 
>> wanted to share it here to reach a broader audience and gather more feedback.
>> 
>> In addition to the message type itself, I’d also be interested in hearing 
>> thoughts on how PyArrow’s interfaces might be extended to allow users to 
>> read and write these arbitrary messages as part of existing IPC stream 
>> readers and writers.
>> 
>> Looking forward to your thoughts and discussion.
>> 
>> Kind regards,
>> Rusty

Re: [DISCUSS] IPC MessageType - OpaqueBytes

Reply via email to