jecsand838 opened a new pull request, #8292:
URL: https://github.com/apache/arrow-rs/pull/8292
# Which issue does this PR close?
This work continues arrow-avro schema resolution support and aligns behavior
with the Avro spec.
- **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the
reader/decoder, including schema resolution and type promotion.
- **Follow-ups/Context**: #8124 (schema resolution & type promotion for the
decoder), #8223 (enum mapping for schema resolution). These previous efforts
established the foundations that this PR extends to default values and
additional resolvable types.
# Rationale for this change
Avro’s **schema resolution** requires readers to reconcile differences
between the writer and reader schemas, including:
- Using record-field **default values** when the writer lacks a field
present in the reader; defaults must be type-correct (i.e., union defaults
match the first union member; bytes/fixed defaults are JSON strings).
- Recursively resolving **arrays** (by item schema) and **maps** (by value
schema).
- Resolving **fixed** types (size and unqualified name must match) and
erroring when they do not.
Prior to this change, arrow-avro’s resolution handled some cases but lacked
full Codec support for **default values** and for resolving **array/map/fixed**
shapes between writer and reader. This led to gaps when reading evolved data or
datasets produced by heterogeneous systems. This PR implements these missing
pieces so the Arrow reader behaves per the spec in common evolution scenarios.
# What changes are included in this PR?
This PR modifies **`arrow-avro/src/codec.rs`** to extend the
schema-resolution path
- **Default value handling** for record fields
- Reads and applies default values when the reader expects a field absent
from the writer, including **nested defaults**.
- Validates defaults per the Avro spec (e.g., union defaults match the
first schema; bytes/fixed defaults are JSON strings).
- **Array / Map / Fixed schema resolution**
- **Array**: recursively resolves item schemas (writer↔reader).
- **Map**: recursively resolves value schemas.
- **Fixed**: enforces matching size and (unqualified) name; otherwise
signals an error, consistent with the spec.
- **Codec updates**
- Refactors internal codec logic to support the above during decoding,
including resolution for **record fields** and **nested defaults**. (See commit
message for the high-level summary.)
# Are these changes tested?
**Yes.** This PR includes new unit tests in `arrow-avro/src/codec.rs`
covering:
1) **Default validation & persistence**
- `Null`/union‑nullability rules; metadata persistence of defaults
(`AVRO_FIELD_DEFAULT_METADATA_KEY`).
2) **`AvroLiteral` Parsing**
- Range checks for `i32`/`f32`; correct literals for `i64`/`f64`;
`Utf8`/`Utf8View`; `uuid` strings (RFC‑4122).
- Byte‑range mapping for `bytes`/`fixed` defaults; `Fixed(n)` length
enforcement; `decimal` on `fixed` vs `bytes`; `duration`/interval fixed
**12**‑byte enforcement.
3) **Collections & records**
- Array/map defaults shape; enum symbol validity; record defaults for
missing fields, required‑field errors, and honoring field‑level defaults;
skip‑fields retained for writer‑only fields.
4) **Resolution mechanics**
- Element **promotion** (`int` to `long`) for arrays; **reader metadata
precedence** for colliding attributes; `fixed` name/size match including
**alias**.
# Are there any user-facing changes?
N/A
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]