[I] [arrow-avro] Add Avro BinaryFormat (Unframed) to reader and writer modules [arrow-rs]

via GitHub Thu, 23 Oct 2025 14:24:25 -0700


jecsand838 opened a new issue, #8701:
URL: https://github.com/apache/arrow-rs/issues/8701

**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**

Currently `arrow-avro` can write **OCF** container files and **SOE**
(Single‑Object Encoding) streams, and it can read **OCF** and framed streams
(SOE / Confluent / Apicurio). It **cannot** write or read *unframed* Avro
**binary "datum"** payloads (i.e., raw Avro record bodies without an
SOE/registry prefix or OCF header). This makes it difficult to interoperate
with systems that exchange naked Avro bodies while providing the schema
out‑of‑band (configuration, RPC contract, topic metadata, etc.).

Concretely:
* **Writer**: there is no `Writer` format that emits *only* the Avro body
bytes per record. SOE always adds a 2‑byte magic + fingerprint (or ID) prefix,
and OCF writes a file header/blocks.
* **Reader**: `ReaderBuilder::build_decoder` **requires** a `SchemaStore`
and expects a frame prefix; when the prefix is missing it errors with "Missing
magic bytes and fingerprint." This prevents decoding raw Avro bodies when the
schema is known upfront.

**Describe the solution you'd like**

Add first‑class **Binary (unframed) format** support to both the writer and
the reader:

1. **Writer**: new unframed stream format
* In `arrow-avro/src/writer/format.rs`:
* Implement a const‑generic `AvroStreamFormat<const PREFIXED: bool>`
templated from the current `AvroSoeFormat` implementation
* Alias `type AvroSoeFormat = AvroStreamFormat<true>` and `type
AvroBinaryFormat = AvroStreamFormat<false>`. The second alias will implement
the new `AvroBinaryFormat` format without code duplication.
* In `arrow-avro/src/writer/mod.rs`, add a public alias called
`AvroRawStreamWriter` as convenience mirroring `AvroStreamWriter`.

> Rationale: the existing `AvroFormat` abstraction already distinguishes
framed vs unframed by `NEEDS_PREFIX` and `sync_marker()`; the new format simply
sets `NEEDS_PREFIX = false` and writes nothing at stream start, yielding only
Avro bodies from `Writer::write_stream`.

2. **Reader**: opt‑in unframed decoding via
`ReaderBuilder::with_reader_schema`
* Enable `ReaderBuilder::build_decoder` to construct a `Decoder` for
**unframed raw binary** when a reader schema is provided **without** a
`SchemaStore`:
* In `arrow-avro/src/reader/mod.rs`:
* **Builder rule**: If `writer_schema_store` is `None` **and**
`reader_schema` is `Some`, `build_decoder()` creates a decoder pre‑configured
for **unframed** inputs. The `reader_schema` is assumed to be **identical** to
the writer schema and *no schema resolution* is supported.
* **Decoder state**: Add a small toggle (i.e., `unframed: bool` or
`enum PrefixMode { Framed, Unframed }`). When `unframed == true`, `decode()`
must **skip** `handle_prefix` and immediately try to decode exactly 1 row body
via `active_decoder.decode(&data[..], 1)`, respecting `batch_size`, and return
consumed bytes accordingly. The current hard error path "Missing magic bytes
and fingerprint" should not trigger in this mode.
* **Safety / behavior**:
- If the byte stream *does* start with a known framing prefix
(SOE/Confluent/Apicurio), return a clear `ArrowError::AvroError("Unexpected
framed prefix in unframed (Binary) mode")` to avoid ambiguous behavior.
- If neither `SchemaStore` **nor** `reader_schema` is provided, keep
returning `InvalidArgumentError` (existing documented behavior) to guide users.

**Describe alternatives you've considered**

* **Keep requiring a prefix and force users to add SOE/Confluent wrappers.**
This breaks compatibility with ecosystems that exchange *only* Avro bodies
(no registry, no framing) and would force users to hand‑craft prefixes that the
other side doesn’t expect. It also goes against the desire (tracked in recent
issues) to reserve `AvroBinaryFormat` for exactly this unframed scenario.
* **Introduce a separate low‑level "datum decoder" type.**
Functionally similar, but adds a duplicate API surface and extra
complexity. The existing `Decoder` already handles row‑by‑row streaming with a
clear separation between "prefix handling" and "body decode;" a small mode
toggle keeps the API cohesive.

**Additional context**

* **Spec references**
* **SOE** is defined by
[Avro](https://avro.apache.org/docs/1.11.1/specification/) as 2‑byte magic
`0xC3 0x01` + fingerprint + Avro body; this is the framing Arrow supports today
for streams. **Binary** in this issue refers to the Avro body alone (no prefix,
no header). OCF remains unchanged.
* `arrow-avro` docs list SOE and OCF as the two writer formats today and
describe framed decoding (SOE/Confluent/Apicurio) for the streaming reader.
* **Why this matters in practice**
Popular systems (i.e., Databricks `from_avro`/`to_avro`) work with *binary
Avro* columns and [allow supplying schemas
manually](https://docs.databricks.com/aws/en/structured-streaming/avro-dataframe#manually-specified-schema-example)
(no frame needed). Adding Binary mode in `arrow-avro` eliminates glue code and
improves interop for stream processors and RPC frameworks that exchange
frameless Avro datums with out‑of‑band schema agreements.
* **Backward compatibility**
* The change is additive. Existing OCF/SOE read/write codepaths are
unaffected.
* `build_decoder()` continues to error if neither a store nor a reader
schema is provided, preserving the documented contract for framed decoding.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [arrow-avro] Add Avro BinaryFormat (Unframed) to reader and writer modules [arrow-rs]

Reply via email to