jecsand838 opened a new pull request, #9171: URL: https://github.com/apache/arrow-rs/pull/9171
# Which issue does this PR close? - Closes #8701. # Rationale for this change `arrow-avro` already supports writing Avro Object Container Files (OCF) and framed streaming encodings (e.g. Single-Object Encoding / registry wire formats). However, many systems exchange **raw Avro binary datum payloads** (i.e. *only* the Avro record body bytes) while supplying the schema out-of-band (configuration, RPC contract, topic metadata, etc.). Without first-class support for unframed datum output, users must either: - accept framing overhead that downstream systems don’t expect, or - re-implement datum encoding themselves. This PR adds the missing unframed write path and exposes a row-by-row encoding API to make it easy to embed Avro datums into other transport protocols. # What changes are included in this PR? - Added `AvroBinaryFormat` (unframed) as an `AvroFormat` implementation to emit **raw Avro record body bytes** (no SOE prefix and no OCF header) and to explicitly reject container-level compression for this format. - Added `RecordEncoder::encode_rows` to encode a `RecordBatch` into a single contiguous buffer while tracking per-row boundaries via appended offsets. - Introduced a higher-level `Encoder` + `EncodedRows` API for row-by-row streaming use cases, providing zero-copy access to individual row slices (via `Bytes`). - Updated the writer API to provide `build_encoder` for stream formats (e.g. SOE) and added row-capacity configuration to better support incremental/streaming workflows. - Added the `bytes` crate as a dependency to support efficient buffering and slicing in the row encoder, and adjusted dev-dependencies to support the new tests/docs. # Are these changes tested? Yes. This PR adds unit tests that cover: - single- and multi-column row encoding - nullable columns - prefix-based vs. unprefixed row encoding behavior - empty batch encoding - appending to existing output buffers and validating offset invariants # Are there any user-facing changes? Yes, these changes are additive (no breaking public API changes expected). - New writer format support for **unframed Avro binary datum** output (`AvroBinaryFormat`). - New row-by-row encoding APIs (`RecordEncoder::encode_rows`, `Encoder`, `EncodedRows`) to support zero-copy access to per-row encoded bytes. - New `WriterBuilder` functionality (`build_encoder` + row-capacity configuration) to enable encoder construction without committing to a specific `Write` sink. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
