jecsand838 opened a new issue, #8701:
URL: https://github.com/apache/arrow-rs/issues/8701

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Currently `arrow-avro` can write **OCF** container files and **SOE** 
(Single‑Object Encoding) streams, and it can read **OCF** and framed streams 
(SOE / Confluent / Apicurio). It **cannot** write or read *unframed* Avro 
**binary "datum"** payloads (i.e., raw Avro record bodies without an 
SOE/registry prefix or OCF header). This makes it difficult to interoperate 
with systems that exchange naked Avro bodies while providing the schema 
out‑of‑band (configuration, RPC contract, topic metadata, etc.).
   
   Concretely:
   * **Writer**: there is no `Writer` format that emits *only* the Avro body 
bytes per record. SOE always adds a 2‑byte magic + fingerprint (or ID) prefix, 
and OCF writes a file header/blocks.
   * **Reader**: `ReaderBuilder::build_decoder` **requires** a `SchemaStore` 
and expects a frame prefix; when the prefix is missing it errors with "Missing 
magic bytes and fingerprint." This prevents decoding raw Avro bodies when the 
schema is known upfront.
   
   **Describe the solution you'd like**
   
   Add first‑class **Binary (unframed) format** support to both the writer and 
the reader:
   
   1. **Writer**: new unframed stream format
       * In `arrow-avro/src/writer/format.rs`:
           * Implement a const‑generic `AvroStreamFormat<const PREFIXED: bool>` 
templated from the current `AvroSoeFormat` implementation
           * Alias `type AvroSoeFormat = AvroStreamFormat<true>` and `type 
AvroBinaryFormat = AvroStreamFormat<false>`. The second alias will implement 
the new `AvroBinaryFormat` format without code duplication.
       * In `arrow-avro/src/writer/mod.rs`, add a public alias called 
`AvroRawStreamWriter` as convenience mirroring `AvroStreamWriter`.
   
       > Rationale: the existing `AvroFormat` abstraction already distinguishes 
framed vs unframed by `NEEDS_PREFIX` and `sync_marker()`; the new format simply 
sets `NEEDS_PREFIX = false` and writes nothing at stream start, yielding only 
Avro bodies from `Writer::write_stream`.
   
   2. **Reader**: opt‑in unframed decoding via 
`ReaderBuilder::with_reader_schema`
       * Enable `ReaderBuilder::build_decoder` to construct a `Decoder` for 
**unframed raw binary** when a reader schema is provided **without** a 
`SchemaStore`:
       * In `arrow-avro/src/reader/mod.rs`:
         * **Builder rule**: If `writer_schema_store` is `None` **and** 
`reader_schema` is `Some`, `build_decoder()` creates a decoder pre‑configured 
for **unframed** inputs. The `reader_schema` is assumed to be **identical** to 
the writer schema and *no schema resolution* is supported.
         * **Decoder state**: Add a small toggle (i.e., `unframed: bool` or 
`enum PrefixMode { Framed, Unframed }`). When `unframed == true`, `decode()` 
must **skip** `handle_prefix` and immediately try to decode exactly 1 row body 
via `active_decoder.decode(&data[..], 1)`, respecting `batch_size`, and return 
consumed bytes accordingly. The current hard error path "Missing magic bytes 
and fingerprint" should not trigger in this mode.
       * **Safety / behavior**:
           - If the byte stream *does* start with a known framing prefix 
(SOE/Confluent/Apicurio), return a clear `ArrowError::AvroError("Unexpected 
framed prefix in unframed (Binary) mode")` to avoid ambiguous behavior.
           - If neither `SchemaStore` **nor** `reader_schema` is provided, keep 
returning `InvalidArgumentError` (existing documented behavior) to guide users.
   
   **Describe alternatives you've considered**
   
   * **Keep requiring a prefix and force users to add SOE/Confluent wrappers.**
     This breaks compatibility with ecosystems that exchange *only* Avro bodies 
(no registry, no framing) and would force users to hand‑craft prefixes that the 
other side doesn’t expect. It also goes against the desire (tracked in recent 
issues) to reserve `AvroBinaryFormat` for exactly this unframed scenario.
   * **Introduce a separate low‑level "datum decoder" type.**
     Functionally similar, but adds a duplicate API surface and extra 
complexity. The existing `Decoder` already handles row‑by‑row streaming with a 
clear separation between "prefix handling" and "body decode;" a small mode 
toggle keeps the API cohesive.
   
   **Additional context**
   
   * **Spec references**
     * **SOE** is defined by 
[Avro](https://avro.apache.org/docs/1.11.1/specification/) as 2‑byte magic 
`0xC3 0x01` + fingerprint + Avro body; this is the framing Arrow supports today 
for streams. **Binary** in this issue refers to the Avro body alone (no prefix, 
no header). OCF remains unchanged.
     * `arrow-avro` docs list SOE and OCF as the two writer formats today and 
describe framed decoding (SOE/Confluent/Apicurio) for the streaming reader.
   * **Why this matters in practice**
     Popular systems (i.e., Databricks `from_avro`/`to_avro`) work with *binary 
Avro* columns and [allow supplying schemas 
manually](https://docs.databricks.com/aws/en/structured-streaming/avro-dataframe#manually-specified-schema-example)
 (no frame needed). Adding Binary mode in `arrow-avro` eliminates glue code and 
improves interop for stream processors and RPC frameworks that exchange 
frameless Avro datums with out‑of‑band schema agreements.
   * **Backward compatibility**
     * The change is additive. Existing OCF/SOE read/write codepaths are 
unaffected.
     * `build_decoder()` continues to error if neither a store nor a reader 
schema is provided, preserving the documented contract for framed decoding.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to