[I] Support `Union` data types for row format [arrow-rs]

via GitHub Wed, 12 Nov 2025 13:39:15 -0800


friendlymatthew opened a new issue, #8828:
URL: https://github.com/apache/arrow-rs/issues/8828


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Since `arrow-row` currently lacks support for Union data types, we cannot 
use `RowConverter` for sorting operations on `Union` columns. This forces us to 
fall back to converting Union types to strings, which is inefficient and loses 
proper type ordering semantics
   
   
   **Describe the solution you'd like**
   Add support for encoding and decoding Union data types in the row format. 
Union types represent a tagged union where each row contains a type ID 
indicating which variant is active and the corresponding value
   
   ## Proposed encoding format
   Each union row would be encoded as: 
   ```
   [null_sentinel: 1 byte][type_id: 1 byte][child_row: variable length]
   ```
   
   where: 
   - `null_sentinel`: 0x00 or 0x01 following existing conventions (inverted for 
desc sort)
   - `type_id`: the `i8` value from the union's `type_ids` buffer
   - `child_row`: the encoded bytes from the apppropriate child field's 
converter
   
   I guess the main design decision involves the ordering semantics across 
different union variants. Rows would sort by type id first, value second. This 
maintains the row format's memcmp-based sorting invariant while providing a 
predictable cross-type ordering
   
   For example, in a `Union<0: Int32, 1: Utf8>`:
   - all `i32` values sort before all string values
   - within integers: standard numeric ordering
   - within strings: lexicographic ordering
   
   Both sparse and dense union modes use the same row encoding format, 
differing only in how they index child arrays during encoding


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support `Union` data types for row format [arrow-rs]

Reply via email to