jecsand838 opened a new issue, #8698:
URL: https://github.com/apache/arrow-rs/issues/8698
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
`arrow-avro` currently implements **Dense** Union end‑to‑end, but **Sparse**
Union is not supported on the writer and is effectively forced to **Dense** on
the reader:
* On the **writer** side, `FieldPlan::build` explicitly rejects sparse
unions with `NotYetImplemented("Sparse Arrow unions are not yet supported")`.
This is reachable when the computed Avro site codec is `Codec::Union(_, _,
UnionMode::Sparse)`.
The writer also has dense‑union unit tests (e.g.
`union_encoder_string_int`, `union_encoder_null_string_int`) that verify bytes
emitted for dense unions, but there is no analogous coverage for sparse unions.
* On the **reader** side, union decoding unconditionally builds Arrow
`UnionArray` **with offsets** (i.e. Dense) in `flush`, rather than honoring the
Arrow union mode. In the union flush path the code always passes
`Some(offsets)` to `UnionArray::try_new`, which produces a **Dense** union.
* In the schema/codec layer, `Codec::Union` already carries a `UnionMode`,
but construction paths currently hard‑code **Dense** both when parsing Avro
unions and when resolving reader/writer union schemas:
`Codec::Union(children, union_fields, UnionMode::Dense)`.
As a result, applications with Arrow schemas using **Sparse** unions (which
Arrow defines as unions without offsets and with **equal‑length child arrays**)
cannot round‑trip through `arrow-avro`. Sparse unions are a first‑class Arrow
layout with important advantages for some vectorized semantics.
**Describe the solution you'd like**
Add **Sparse Union** support to both the `reader` and `writer`, leveraging
the existing `UnionMode` carried by `Codec::Union`.
The high‑level goals:
1. **Writer**: Encode Avro values from an Arrow **Sparse** `UnionArray` just
like dense (Avro encoding of unions is branch‑tag + branch payload either way),
but obtain child values by **row index** instead of dense offsets. Enable
`FieldPlan::build` and `UnionEncoder` to handle `UnionMode::Sparse`.
2. **Reader**: Extend `UnionDecoder` so that when target Arrow site is a
**Sparse** union, it builds a `UnionArray` **without offsets** and ensures each
child receives one slot per logical row and appends nulls to non‑selected
children. Continue current behavior for Dense.
3. **Codec/Schema**: * **Preserve** union layout hints in Avro field
metadata using existing keys:
* `arrowUnionMode` ∈ {`"Dense"`, `"Sparse"`}
* `arrowUnionTypeIds` = JSON array of type IDs
This ensures a round‑trip preserves the Arrow union flavor and stable
type IDs.
* **Honor** these keys during schema parsing / resolution so that
`Codec::Union(..., UnionMode::Sparse)` can be produced when present (default
remains Dense if absent). Today, codec construction hard‑codes `Dense` for
unions.
This keeps Avro bytes and schema semantics unchanged for Dense unions and
adds a parallel Sparse path consistent with the Arrow spec by:
* Sparse unions omitting offsets
* Children having full length
* Top‑level validity coming from children.
**Describe alternatives you've considered**
* **Always normalize to Dense**: We could keep writing/reading only Dense
and convert Sparse to Dense at the boundaries. This would "work" but defeats
the purpose of carrying Sparse union semantics across systems and can have
performance implications, given Sparse can be preferable for some vectorized
operations.
* **Map Sparse to Struct + validity**: Another option is mapping to a
`Struct` with a type‑id column. This diverges from Arrow’s union layout, breaks
consumer expectations, and is incompatible with existing Dense union round‑trip
behavior.
**Additional context**
* **Backwards compatibility**: If the incoming Avro schema lacks
`arrowUnionMode`, the codec defaults to **Dense** exactly as today. Existing
files and tests continue to pass.
* **Nulls**: Unions have no top‑level validity. In **Sparse** mode,
non‑selected children **must** append a null per row so all children have equal
length. (Arrow spec requirement.)
* **Type IDs**: Use `arrowUnionTypeIds` when present to construct
`UnionFields::new(type_ids, fields)`; otherwise keep current behavior
(deterministic IDs from `build_union_fields`).
* **Avro bytes identical across modes**: Dense vs Sparse is an Arrow‑side
layout choice; Avro encoding is the same (branch index + payload). Our per‑row
encoder only changes how it **fetches** the child element (offset vs row
index); emitted bytes remain unchanged for the same logical values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]