[PR] Regression Testing, Bug Fixes, and Public API Tightening for arrow-avro [arrow-rs]

via GitHub Sat, 18 Oct 2025 04:46:35 -0700


jecsand838 opened a new pull request, #8492:
URL: https://github.com/apache/arrow-rs/pull/8492


   # Which issue does this PR close?
   
   - **Related to**: #4886 (“Add Avro Support”)
   
   # Rationale for this change
   
   **NOTE:** This PR contains over **2300 lines of test code**. The actual 
production code diff is **less than 800 LOC**.
   
   Before we publish `arrow-avro`, we want to "minimize its public API surface" 
and ship a well‑tested, spec‑compliant implementation. In the process of adding 
intensive regression tests and canonical‑form checks, we found several 
correctness gaps around alias handling, union resolution, Unicode/name 
validation, list child nullability, “null” string handling, and a mis-wired 
`Writer` capacity setting. This PR tightens the API and fixes those issues to 
align with the Avro spec (aliases and defaults, union resolution, names and 
Unicode, etc.).
   
   # What changes are included in this PR?
   
   **Public API tightening**
   - Restrict visibility of numerous schema/codec types and functions within 
`arrow-avro` so only intended entry points are public ahead of making the crate 
public. 
   
   **Bug fixes discovered via regression testing (All fixed)**
   1. **Alias bugs (aliases without defaults / restrictive namespaces)**  
      - Enforce spec‑compliant alias resolution: aliases may be fully‑qualified 
or relative to the reader’s namespace, and alias‑based rewrites still require 
reader defaults when the writer field is absent. This follows Avro’s alias 
rules and record‑field default behavior.
   2. **Special‑case union resolution (writer not a union, reader is)**  
      - When the writer schema is **not** a union but the reader is, we no 
longer attempt to decode a union `type_id`; per spec, the reader must pick the 
first union branch that matches the writer’s schema.
   3. **Valid Avro Unicode characters & name rules in Schema**  
      - Distinguish between *Unicode strings* (which may contain any valid 
UTF‑8) and *identifiers* (names/enum symbols) which must match 
`[A-Za-z_][A-Za-z0-9_]*`. Tests were added to accept valid Unicode string 
content while enforcing the ASCII identifier regex.
   4. **Nullable `ListArray` child item bug**  
      - Correct mapping of Avro array item nullability to Arrow `ListArray`’s 
inner `"item"` field. (By convention the inner field is named `"item"` and 
nullability is explicit.) This aligns with Arrow’s builder/typing docs.
   5. **“null” string vs. hard `null`**  
      - Fix default/value handling to differentiate JSON `null` from the string 
literal `"null"` per the Avro defaults table. 
   6. **`Writer` capacity knob wired up**  
      - Plumb the provided capacity through the writer implementation so 
preallocation/knobbed capacity is respected. (See changes under 
`arrow-avro/src/writer/mod.rs`.)
   
   # Are these changes tested?
   
   Yes. This PR adds substantial regression coverage:
   - Canonical‑form checks for schemas.
   - Alias/namespace + default‑value resolution cases.
   - Reader‑union vs. writer‑non‑union decoding paths.
   - Unicode content vs. identifier name rules.
   - `ListArray` inner field nullability behavior.
   - Round‑trips exercising the `Writer` with the capacity knob set.
     
   A new, comprehensive Avro fixture (`test/data/comprehensive_e2e.avro`) is 
included to drive end‑to‑end scenarios and edge cases,.
   
   # Are there any user-facing changes?
   
   N/A


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Regression Testing, Bug Fixes, and Public API Tightening for arrow-avro [arrow-rs]

Reply via email to