jecsand838 opened a new pull request, #8492:
URL: https://github.com/apache/arrow-rs/pull/8492
# Which issue does this PR close?
- **Related to**: #4886 (“Add Avro Support”)
# Rationale for this change
**NOTE:** This PR contains over **2300 lines of test code**. The actual
production code diff is **less than 800 LOC**.
Before we publish `arrow-avro`, we want to "minimize its public API surface"
and ship a well‑tested, spec‑compliant implementation. In the process of adding
intensive regression tests and canonical‑form checks, we found several
correctness gaps around alias handling, union resolution, Unicode/name
validation, list child nullability, “null” string handling, and a mis-wired
`Writer` capacity setting. This PR tightens the API and fixes those issues to
align with the Avro spec (aliases and defaults, union resolution, names and
Unicode, etc.).
# What changes are included in this PR?
**Public API tightening**
- Restrict visibility of numerous schema/codec types and functions within
`arrow-avro` so only intended entry points are public ahead of making the crate
public.
**Bug fixes discovered via regression testing (All fixed)**
1. **Alias bugs (aliases without defaults / restrictive namespaces)**
- Enforce spec‑compliant alias resolution: aliases may be fully‑qualified
or relative to the reader’s namespace, and alias‑based rewrites still require
reader defaults when the writer field is absent. This follows Avro’s alias
rules and record‑field default behavior.
2. **Special‑case union resolution (writer not a union, reader is)**
- When the writer schema is **not** a union but the reader is, we no
longer attempt to decode a union `type_id`; per spec, the reader must pick the
first union branch that matches the writer’s schema.
3. **Valid Avro Unicode characters & name rules in Schema**
- Distinguish between *Unicode strings* (which may contain any valid
UTF‑8) and *identifiers* (names/enum symbols) which must match
`[A-Za-z_][A-Za-z0-9_]*`. Tests were added to accept valid Unicode string
content while enforcing the ASCII identifier regex.
4. **Nullable `ListArray` child item bug**
- Correct mapping of Avro array item nullability to Arrow `ListArray`’s
inner `"item"` field. (By convention the inner field is named `"item"` and
nullability is explicit.) This aligns with Arrow’s builder/typing docs.
5. **“null” string vs. hard `null`**
- Fix default/value handling to differentiate JSON `null` from the string
literal `"null"` per the Avro defaults table.
6. **`Writer` capacity knob wired up**
- Plumb the provided capacity through the writer implementation so
preallocation/knobbed capacity is respected. (See changes under
`arrow-avro/src/writer/mod.rs`.)
# Are these changes tested?
Yes. This PR adds substantial regression coverage:
- Canonical‑form checks for schemas.
- Alias/namespace + default‑value resolution cases.
- Reader‑union vs. writer‑non‑union decoding paths.
- Unicode content vs. identifier name rules.
- `ListArray` inner field nullability behavior.
- Round‑trips exercising the `Writer` with the capacity knob set.
A new, comprehensive Avro fixture (`test/data/comprehensive_e2e.avro`) is
included to drive end‑to‑end scenarios and edge cases,.
# Are there any user-facing changes?
N/A
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]