[I] [arrow-avro] Implement full Apache Avro v1.12.0 specification support [arrow-rs]

via GitHub Thu, 23 Oct 2025 15:09:36 -0700


jecsand838 opened a new issue, #8703:
URL: https://github.com/apache/arrow-rs/issues/8703


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Many Avro producers and schema registries are rolling out **Apache Avro 
1.12.0** features, and we want `arrow-avro` to interoperate cleanly with these 
datasets. [Avro 1.12.0](https://avro.apache.org/docs/1.12.0/specification/) 
formally introduces and/or extends several logical types and container‑file 
behaviors that matter for Arrow:
   * [**Nanosecond 
timestamps**](https://avro.apache.org/docs/1.12.0/specification/#timestamps) 
for both global and local variants: `timestamp-nanos` and 
`local-timestamp-nanos`.
   * [**UUID**](https://avro.apache.org/docs/1.12.0/specification/#uuid) can 
annotate **either** `string` **or** `fixed(16)`.
   * [**Big 
Decimal**](https://avro.apache.org/docs/1.12.0/specification/#decimal) 
(`big-decimal`) as an alternative decimal logical type on **bytes**, where the 
**scale is stored per value** (not in the schema).
   * **Object Container File optional codecs** include **zstandard**, **xz**, 
**bzip2** (in addition to `null`/`deflate` required; `snappy` remains optional 
with CRC behavior).
   
   Today, `arrow-avro` already exposes codecs & APIs for many of these, but 
there are gaps for complete 1.12.0 coverage:
   
   * The `Codec` enum already includes **`TimestampNanos(bool)`** and 
**`Uuid`** variants, showing decoding pathways exist; however, Avro 1.12’s 
*dual representation* of UUID (string or fixed(16)) needs careful mapping to 
Arrow (and to the canonical Arrow UUID extension) without breaking users.
   * The crate documents feature‑gated support for 
**zstd/xz/bzip2/snappy/deflate** in OCF, but we lack v1.12.0 specific 
round‑trip tests that validate OCF **per‑block** codec handling against the 
1.12 spec.
   * **`big-decimal`** is new in 1.12 and currently **not modeled** in 
`arrow-avro`. It requires a safe Arrow mapping strategy because Arrow decimals 
have fixed `(precision, scale)` in the schema, while Avro `big-decimal` stores 
**scale per value**.
   
   **Describe the solution you'd like**
   
   Implement **full Avro 1.12.0** support in `arrow-avro` and gate any 
**breaking observable changes** behind a new **Cargo feature flag**: 
`avro_1_12` (default **off**).
   
   1. **New Cargo feature flag**
       * **Feature:** `avro_1_12` (default off).
       * **Purpose**: put *breaking‑observable* behavior behind an opt‑in flag 
so existing users see no change by default.
   2. **Logical types**
       * 2.1 UUID (reader & writer)
         * **Spec:** Avro 1.12 allows `uuid` on `string` **or** `fixed(16)`; 
both must conform to RFC‑4122.
         * **Current:** `arrow-avro` has a `Codec::Uuid` and Arrow provides a 
**canonical UUID extension** (`arrow.uuid`) with storage `FixedSizeBinary(16)`.
         * **Reader behavior (flag‑gated):**
           * **With `avro_1_12` ON:** Map **both** Avro UUID encodings 
(`string+uuid` and `fixed(16)+uuid`) to Arrow **`FixedSizeBinary(16)`** and, 
when `arrow_schema` extension types are enabled, attach the **`arrow.uuid`** 
extension. This unifies UUID representation for consumers.
           * **With `avro_1_12` OFF (default):** Preserve existing behavior 
(i.e., if current readers produce `Utf8` for `string+uuid`, keep doing so), to 
avoid breaking schemas for users who aren’t ready to switch.
         * **Writer behavior (non‑breaking API addition):**
           * Add a configuration to choose how to **emit** Avro UUID, for 
instance maybe something like this:
           ```rust
           // behind feature "avro_1_12"
           #[derive(Debug, Clone, Copy)]
           pub enum UuidEncoding { String, Fixed16 }
   
           impl WriterBuilder {
               #[cfg(feature = "avro_1_12")]
               pub fn with_uuid_encoding(mut self, e: UuidEncoding) -> Self { 
/* store choice */ self }
           }
           ```
           * Default remains current behavior (emit 
`{"type":"string","logicalType":"uuid"}` when the Arrow field carries the 
`arrow.uuid` extension).
           * If `UuidEncoding::Fixed16` is chosen, emit:
           ```json
           { "type": "fixed", "size": 16, "logicalType": "uuid" }
           ```
       * 2.2 `big-decimal`
         * **Spec:** `big-decimal` is an alternative *decimal* logical type 
restricted to `bytes` where **scale is stored in each value**, not in the 
schema.
         * **Reader default mapping (non‑breaking):**
           * Add `Codec::BigDecimal` and **map to Arrow `Binary`** by default, 
because Arrow `Decimal{32,64,128,256}` requires **fixed** `(precision, scale)` 
in the schema, which `big-decimal` does not provide. Spec also states that 
implementations **must ignore unknown/invalid logical types** and use the 
underlying Avro type; representing `big-decimal` as bytes is the safe, lossless 
default.
         * **Writer support:**
           * If the Arrow field is a `Decimal{32,64,128,256}`, add an 
**opt‑in** builder switch to **write** Avro `big-decimal` by taking the Arrow 
value’s unscaled integer and **embedding the (constant) Arrow scale** per row 
into the Avro `bytes` payload per the Avro 1.12 encoding (scale is stored in 
value itself). Default remains to not emit big‑decimal unless the user opts in 
or provides an Avro schema hint.
   3. **Public API additions**
       * **Feature‑gated:**
       `WriterBuilder::with_uuid_encoding(UuidEncoding)` (default `String`).
       * **Non‑breaking additions:**
       `Codec::BigDecimal` and schema detection for `{"type": "bytes", 
"logicalType": "big-decimal"}`.
   
   **Describe alternatives you've considered**
   
   * **Map `big-decimal` directly to Arrow `Decimal*`:** Unsafe without a 
**fixed** scale in the schema. Unfortunately, Avro 1.12 explicitly puts **scale 
in each value**, which conflicts with Arrow’s schema‑fixed decimal semantics. 
Defaulting to `Binary` would be correct and spec‑compliant.
   * **No feature flag:** Unifying UUID mapping changes observable Arrow 
schemas for some users. Gating with `avro_1_12` avoids unintended breakage and 
aligns with our stability goals.
   
   **Additional context**
   
   * **Backwards Compatibility**
     * **Feature OFF (default):**
       * No observable behavior change for existing users (UUID string stays as 
currently produced; `big-decimal` read as bytes; timestamps unchanged).
     * **Feature ON:**
       * Reader unifies UUID mapping to Arrow `FixedSizeBinary(16)` (+ 
`arrow.uuid` extension when enabled), which can change schemas for some users 
(hence the opt‑in flag).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [arrow-avro] Implement full Apache Avro v1.12.0 specification support [arrow-rs]

Reply via email to