jecsand838 opened a new issue, #8703: URL: https://github.com/apache/arrow-rs/issues/8703
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Many Avro producers and schema registries are rolling out **Apache Avro 1.12.0** features, and we want `arrow-avro` to interoperate cleanly with these datasets. [Avro 1.12.0](https://avro.apache.org/docs/1.12.0/specification/) formally introduces and/or extends several logical types and container‑file behaviors that matter for Arrow: * [**Nanosecond timestamps**](https://avro.apache.org/docs/1.12.0/specification/#timestamps) for both global and local variants: `timestamp-nanos` and `local-timestamp-nanos`. * [**UUID**](https://avro.apache.org/docs/1.12.0/specification/#uuid) can annotate **either** `string` **or** `fixed(16)`. * [**Big Decimal**](https://avro.apache.org/docs/1.12.0/specification/#decimal) (`big-decimal`) as an alternative decimal logical type on **bytes**, where the **scale is stored per value** (not in the schema). * **Object Container File optional codecs** include **zstandard**, **xz**, **bzip2** (in addition to `null`/`deflate` required; `snappy` remains optional with CRC behavior). Today, `arrow-avro` already exposes codecs & APIs for many of these, but there are gaps for complete 1.12.0 coverage: * The `Codec` enum already includes **`TimestampNanos(bool)`** and **`Uuid`** variants, showing decoding pathways exist; however, Avro 1.12’s *dual representation* of UUID (string or fixed(16)) needs careful mapping to Arrow (and to the canonical Arrow UUID extension) without breaking users. * The crate documents feature‑gated support for **zstd/xz/bzip2/snappy/deflate** in OCF, but we lack v1.12.0 specific round‑trip tests that validate OCF **per‑block** codec handling against the 1.12 spec. * **`big-decimal`** is new in 1.12 and currently **not modeled** in `arrow-avro`. It requires a safe Arrow mapping strategy because Arrow decimals have fixed `(precision, scale)` in the schema, while Avro `big-decimal` stores **scale per value**. **Describe the solution you'd like** Implement **full Avro 1.12.0** support in `arrow-avro` and gate any **breaking observable changes** behind a new **Cargo feature flag**: `avro_1_12` (default **off**). 1. **New Cargo feature flag** * **Feature:** `avro_1_12` (default off). * **Purpose**: put *breaking‑observable* behavior behind an opt‑in flag so existing users see no change by default. 2. **Logical types** * 2.1 UUID (reader & writer) * **Spec:** Avro 1.12 allows `uuid` on `string` **or** `fixed(16)`; both must conform to RFC‑4122. * **Current:** `arrow-avro` has a `Codec::Uuid` and Arrow provides a **canonical UUID extension** (`arrow.uuid`) with storage `FixedSizeBinary(16)`. * **Reader behavior (flag‑gated):** * **With `avro_1_12` ON:** Map **both** Avro UUID encodings (`string+uuid` and `fixed(16)+uuid`) to Arrow **`FixedSizeBinary(16)`** and, when `arrow_schema` extension types are enabled, attach the **`arrow.uuid`** extension. This unifies UUID representation for consumers. * **With `avro_1_12` OFF (default):** Preserve existing behavior (i.e., if current readers produce `Utf8` for `string+uuid`, keep doing so), to avoid breaking schemas for users who aren’t ready to switch. * **Writer behavior (non‑breaking API addition):** * Add a configuration to choose how to **emit** Avro UUID, for instance maybe something like this: ```rust // behind feature "avro_1_12" #[derive(Debug, Clone, Copy)] pub enum UuidEncoding { String, Fixed16 } impl WriterBuilder { #[cfg(feature = "avro_1_12")] pub fn with_uuid_encoding(mut self, e: UuidEncoding) -> Self { /* store choice */ self } } ``` * Default remains current behavior (emit `{"type":"string","logicalType":"uuid"}` when the Arrow field carries the `arrow.uuid` extension). * If `UuidEncoding::Fixed16` is chosen, emit: ```json { "type": "fixed", "size": 16, "logicalType": "uuid" } ``` * 2.2 `big-decimal` * **Spec:** `big-decimal` is an alternative *decimal* logical type restricted to `bytes` where **scale is stored in each value**, not in the schema. * **Reader default mapping (non‑breaking):** * Add `Codec::BigDecimal` and **map to Arrow `Binary`** by default, because Arrow `Decimal{32,64,128,256}` requires **fixed** `(precision, scale)` in the schema, which `big-decimal` does not provide. Spec also states that implementations **must ignore unknown/invalid logical types** and use the underlying Avro type; representing `big-decimal` as bytes is the safe, lossless default. * **Writer support:** * If the Arrow field is a `Decimal{32,64,128,256}`, add an **opt‑in** builder switch to **write** Avro `big-decimal` by taking the Arrow value’s unscaled integer and **embedding the (constant) Arrow scale** per row into the Avro `bytes` payload per the Avro 1.12 encoding (scale is stored in value itself). Default remains to not emit big‑decimal unless the user opts in or provides an Avro schema hint. 3. **Public API additions** * **Feature‑gated:** `WriterBuilder::with_uuid_encoding(UuidEncoding)` (default `String`). * **Non‑breaking additions:** `Codec::BigDecimal` and schema detection for `{"type": "bytes", "logicalType": "big-decimal"}`. **Describe alternatives you've considered** * **Map `big-decimal` directly to Arrow `Decimal*`:** Unsafe without a **fixed** scale in the schema. Unfortunately, Avro 1.12 explicitly puts **scale in each value**, which conflicts with Arrow’s schema‑fixed decimal semantics. Defaulting to `Binary` would be correct and spec‑compliant. * **No feature flag:** Unifying UUID mapping changes observable Arrow schemas for some users. Gating with `avro_1_12` avoids unintended breakage and aligns with our stability goals. **Additional context** * **Backwards Compatibility** * **Feature OFF (default):** * No observable behavior change for existing users (UUID string stays as currently produced; `big-decimal` read as bytes; timestamps unchanged). * **Feature ON:** * Reader unifies UUID mapping to Arrow `FixedSizeBinary(16)` (+ `arrow.uuid` extension when enabled), which can change schemas for some users (hence the opt‑in flag). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
