[I] [arrow-avro] Add Sparse Union support [arrow-rs]

via GitHub Thu, 23 Oct 2025 13:33:02 -0700


jecsand838 opened a new issue, #8698:
URL: https://github.com/apache/arrow-rs/issues/8698


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   `arrow-avro` currently implements **Dense** Union end‑to‑end, but **Sparse** 
Union is not supported on the writer and is effectively forced to **Dense** on 
the reader:
   
   * On the **writer** side, `FieldPlan::build` explicitly rejects sparse 
unions with `NotYetImplemented("Sparse Arrow unions are not yet supported")`. 
This is reachable when the computed Avro site codec is `Codec::Union(_, _, 
UnionMode::Sparse)`.
     The writer also has dense‑union unit tests (e.g. 
`union_encoder_string_int`, `union_encoder_null_string_int`) that verify bytes 
emitted for dense unions, but there is no analogous coverage for sparse unions.
   * On the **reader** side, union decoding unconditionally builds Arrow 
`UnionArray` **with offsets** (i.e. Dense) in `flush`, rather than honoring the 
Arrow union mode. In the union flush path the code always passes 
`Some(offsets)` to `UnionArray::try_new`, which produces a **Dense** union.
   * In the schema/codec layer, `Codec::Union` already carries a `UnionMode`, 
but construction paths currently hard‑code **Dense** both when parsing Avro 
unions and when resolving reader/writer union schemas:
     `Codec::Union(children, union_fields, UnionMode::Dense)`.
   
   As a result, applications with Arrow schemas using **Sparse** unions (which 
Arrow defines as unions without offsets and with **equal‑length child arrays**) 
cannot round‑trip through `arrow-avro`. Sparse unions are a first‑class Arrow 
layout with important advantages for some vectorized semantics.
   
   **Describe the solution you'd like**
   
   Add **Sparse Union** support to both the `reader` and `writer`, leveraging 
the existing `UnionMode` carried by `Codec::Union`. 
   
   The high‑level goals:
   1. **Writer**: Encode Avro values from an Arrow **Sparse** `UnionArray` just 
like dense (Avro encoding of unions is branch‑tag + branch payload either way), 
but obtain child values by **row index** instead of dense offsets. Enable 
`FieldPlan::build` and `UnionEncoder` to handle `UnionMode::Sparse`.
   2. **Reader**: Extend `UnionDecoder` so that when target Arrow site is a 
**Sparse** union, it builds a `UnionArray` **without offsets** and ensures each 
child receives one slot per logical row and appends nulls to non‑selected 
children. Continue current behavior for Dense.
   3. **Codec/Schema**:   * **Preserve** union layout hints in Avro field 
metadata using existing keys:
        * `arrowUnionMode` ∈ {`"Dense"`, `"Sparse"`}
        * `arrowUnionTypeIds` = JSON array of type IDs
          This ensures a round‑trip preserves the Arrow union flavor and stable 
type IDs.
      * **Honor** these keys during schema parsing / resolution so that 
`Codec::Union(..., UnionMode::Sparse)` can be produced when present (default 
remains Dense if absent). Today, codec construction hard‑codes `Dense` for 
unions.
   
   This keeps Avro bytes and schema semantics unchanged for Dense unions and 
adds a parallel Sparse path consistent with the Arrow spec by: 
   * Sparse unions omitting offsets
   * Children having full length
   * Top‑level validity coming from children.
   
   **Describe alternatives you've considered**
   
   * **Always normalize to Dense**: We could keep writing/reading only Dense 
and convert Sparse to Dense at the boundaries. This would "work" but defeats 
the purpose of carrying Sparse union semantics across systems and can have 
performance implications, given Sparse can be preferable for some vectorized 
operations.
   * **Map Sparse to Struct + validity**: Another option is mapping to a 
`Struct` with a type‑id column. This diverges from Arrow’s union layout, breaks 
consumer expectations, and is incompatible with existing Dense union round‑trip 
behavior.
   
   **Additional context**
   
   * **Backwards compatibility**: If the incoming Avro schema lacks 
`arrowUnionMode`, the codec defaults to **Dense** exactly as today. Existing 
files and tests continue to pass.
   * **Nulls**: Unions have no top‑level validity. In **Sparse** mode, 
non‑selected children **must** append a null per row so all children have equal 
length. (Arrow spec requirement.)
   * **Type IDs**: Use `arrowUnionTypeIds` when present to construct 
`UnionFields::new(type_ids, fields)`; otherwise keep current behavior 
(deterministic IDs from `build_union_fields`).
   * **Avro bytes identical across modes**: Dense vs Sparse is an Arrow‑side 
layout choice; Avro encoding is the same (branch index + payload). Our per‑row 
encoder only changes how it **fetches** the child element (offset vs row 
index); emitted bytes remain unchanged for the same logical values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [arrow-avro] Add Sparse Union support [arrow-rs]

Reply via email to