jecsand838 opened a new pull request, #9237:
URL: https://github.com/apache/arrow-rs/pull/9237
# Which issue does this PR close?
- Closes #9231.
# Rationale for this change
Avro schema resolution allows a reader schema to represent “nullable” values
using a two-branch union (`["null", T]` or `[T, "null"]`) while still reading
data written with the non-union schema `T` (i.e. without union discriminants in
the encoded data).
In `arrow-avro`, resolving a non-union writer type against a reader union
(notably for array/list item schemas like `items: ["null", "int"]`) could
incorrectly treat the encoded stream as a union and attempt to decode a union
discriminant. This would misalign decoding and could surface as
`ParseError("bad varint")` for certain files (see #9231).
# What changes are included in this PR?
- Fix schema resolution when the *writer* schema is non-union and the
*reader* schema is a union:
- Special-case two-branch unions containing `null` to be treated as
“nullable” (capturing whether `null` is first or second), and resolve against
the non-null branch.
- Improve matching for general reader unions by attempting to resolve
against each union variant, preferring a direct match, and constructing the
appropriate union resolution mapping for the selected branch.
- Ensure promotions are represented at the union-resolution level
(avoiding nested promotion resolution on the selected union child).
- Add regression coverage for the bug and the fixed behavior:
- `test_resolve_array_writer_nonunion_items_reader_nullable_items` (schema
resolution / codec)
- `test_array_decoding_writer_nonunion_items_reader_nullable_items`
(record decoding; ensures correct byte consumption and decoded values)
- `test_bad_varint_bug_nullable_array_items` (end-to-end reader regression
using a small Avro fixture)
- Add a small compressed Avro fixture under
`arrow-avro/test/data/bad-varint-bug.avro.gz` used by the regression test.
# Are these changes tested?
Yes. This PR adds targeted unit/integration tests that reproduce the prior
failure mode and validate correct schema resolution and decoding for
nullable-union array items.
# Are there any user-facing changes?
Yes (bug fix): reading Avro files with arrays whose element type is
represented as a nullable union in the reader schema (e.g. `items: ["null",
"int"]`) now succeeds instead of failing with `ParseError("bad varint")`. No
public API changes are intended.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]