[PR] Add BinaryFormatSupport to `arrow-avro` Writer [arrow-rs]

via GitHub Wed, 14 Jan 2026 07:51:44 -0800


jecsand838 opened a new pull request, #9171:
URL: https://github.com/apache/arrow-rs/pull/9171


   # Which issue does this PR close?
   
   - Closes #8701.
   
   # Rationale for this change
   
   `arrow-avro` already supports writing Avro Object Container Files (OCF) and 
framed streaming encodings (e.g. Single-Object Encoding / registry wire 
formats). However, many systems exchange **raw Avro binary datum payloads** 
(i.e. *only* the Avro record body bytes) while supplying the schema out-of-band 
(configuration, RPC contract, topic metadata, etc.).
   
   Without first-class support for unframed datum output, users must either:
   - accept framing overhead that downstream systems don’t expect, or
   - re-implement datum encoding themselves.
   
   This PR adds the missing unframed write path and exposes a row-by-row 
encoding API to make it easy to embed Avro datums into other transport 
protocols.
   
   # What changes are included in this PR?
   
   - Added `AvroBinaryFormat` (unframed) as an `AvroFormat` implementation to 
emit **raw Avro record body bytes** (no SOE prefix and no OCF header) and to 
explicitly reject container-level compression for this format.
   - Added `RecordEncoder::encode_rows` to encode a `RecordBatch` into a single 
contiguous buffer while tracking per-row boundaries via appended offsets.
   - Introduced a higher-level `Encoder` + `EncodedRows` API for row-by-row 
streaming use cases, providing zero-copy access to individual row slices (via 
`Bytes`).
   - Updated the writer API to provide `build_encoder` for stream formats (e.g. 
SOE) and added row-capacity configuration to better support 
incremental/streaming workflows.
   - Added the `bytes` crate as a dependency to support efficient buffering and 
slicing in the row encoder, and adjusted dev-dependencies to support the new 
tests/docs.
   
   # Are these changes tested?
   
   Yes.
   
   This PR adds unit tests that cover:
   - single- and multi-column row encoding
   - nullable columns
   - prefix-based vs. unprefixed row encoding behavior
   - empty batch encoding
   - appending to existing output buffers and validating offset invariants
   
   # Are there any user-facing changes?
   
   Yes, these changes are additive (no breaking public API changes expected).
   
   - New writer format support for **unframed Avro binary datum** output 
(`AvroBinaryFormat`).
   - New row-by-row encoding APIs (`RecordEncoder::encode_rows`, `Encoder`, 
`EncodedRows`) to support zero-copy access to per-row encoded bytes.
   - New `WriterBuilder` functionality (`build_encoder` + row-capacity 
configuration) to enable encoder construction without committing to a specific 
`Write` sink.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add BinaryFormatSupport to `arrow-avro` Writer [arrow-rs]

Reply via email to