friendlymatthew opened a new issue, #8828: URL: https://github.com/apache/arrow-rs/issues/8828
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Since `arrow-row` currently lacks support for Union data types, we cannot use `RowConverter` for sorting operations on `Union` columns. This forces us to fall back to converting Union types to strings, which is inefficient and loses proper type ordering semantics **Describe the solution you'd like** Add support for encoding and decoding Union data types in the row format. Union types represent a tagged union where each row contains a type ID indicating which variant is active and the corresponding value ## Proposed encoding format Each union row would be encoded as: ``` [null_sentinel: 1 byte][type_id: 1 byte][child_row: variable length] ``` where: - `null_sentinel`: 0x00 or 0x01 following existing conventions (inverted for desc sort) - `type_id`: the `i8` value from the union's `type_ids` buffer - `child_row`: the encoded bytes from the apppropriate child field's converter I guess the main design decision involves the ordering semantics across different union variants. Rows would sort by type id first, value second. This maintains the row format's memcmp-based sorting invariant while providing a predictable cross-type ordering For example, in a `Union<0: Int32, 1: Utf8>`: - all `i32` values sort before all string values - within integers: standard numeric ordering - within strings: lexicographic ordering Both sparse and dense union modes use the same row encoding format, differing only in how they index child arrays during encoding -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
