zhuqi-lucas opened a new issue, #22715:
URL: https://github.com/apache/datafusion/issues/22715

   ## Background
   
   `GroupValuesColumn` (the column-wise multi-column GROUP BY storage) has 
type-specific specializations under `multi_group_by/` for primitive numerics, 
byte / byte-view, boolean, decimal128, date / time / timestamp variants. Any 
type outside that set drags the entire grouping onto the byte-encoded 
`GroupValuesRows` fallback (see #22682 for the structural cost of that lock-in 
on wide GROUP BYs).
   
   #22682 + PR #22706 add `GroupColumn` support for the nested cases 
(`FixedSizeList<primitive>`, `List<T>`, `LargeList<T>`, `Struct<...>`) plus a 
recursive `make_group_column` factory. Once that factory is in place, 
additional primitive specializations become small, mostly local additions: a 
new builder file, one dispatch arm, an entry in `supported_type`, and tests.
   
   This EPIC tracks the remaining common primitive types that are not yet 
supported.
   
   ## Out of scope
   
   - Nested type support (List / Struct / Map / FixedSizeList): tracked in 
#22682, in flight via #22706.
   - Generic fallback so any Arrow type goes through `GroupValuesColumn`: 
tracked in #22701 (orthogonal direction).
   
   ## Types to add, easiest first
   
   - [ ] **`FixedSizeBinary`**. Fixed-width bytes per row. Closest in shape to 
`PrimitiveGroupValueBuilder` but with a runtime-known fixed byte width. Likely 
the smallest new builder.
   - [ ] **`Float16`**. Arrow already has the primitive type. Need explicit NaN 
handling in `is_eq` (Float16's `is_eq` returns false for NaN-vs-NaN, but for 
GROUP BY semantics two NaN keys should typically be considered equal — verify 
against the existing Float32 / Float64 behavior in 
`PrimitiveGroupValueBuilder`).
   - [ ] **`Duration(TimeUnit)`**. Same shape as `Timestamp` (four `TimeUnit` 
arms in the dispatcher), four `DurationXxxType` slot-ins.
   - [ ] **`Interval(IntervalUnit)`**. Three variants (YearMonth = 4 bytes, 
DayTime = 8 bytes, MonthDayNano = 16 bytes), so three separate dispatcher arms 
and three native widths.
   - [ ] **`Decimal256`**. `arrow::array::types::Decimal256Type` has `Native = 
arrow_buffer::i256`, which is a 32-byte struct rather than a `Copy`-cheap 
native scalar. Either relax the `T: Copy` requirement in 
`PrimitiveGroupValueBuilder` or add a sibling builder specialized to wide 
native types.
   - [ ] **`Dictionary<K, V>`**. Most involved. Need to decide on semantics for 
the group key:
     - **Option A**: hash / compare on the dictionary's decoded logical value. 
Conceptually clean, behaves like `Utf8` / `Binary`. Costlier in memory because 
each unique decoded value is materialized at intern time.
     - **Option B**: hash / compare on the encoded key under a fixed-dictionary 
contract (i.e. the same `K -> V` mapping is asserted across batches). Cheaper 
but only safe if the dictionary is shared / known-stable, which is not 
guaranteed by Arrow at the schema level.
     - Worth its own discussion before any implementation lands.
   
   ## Cross-cutting
   
   - All new builders should follow the same testing structure as the existing 
primitive ones: append / build round trip, equal_to (identical + different + 
null edges), `take_n` boundary cases (`n=0`, `n=len`, with-null prefix), sliced 
input, `vectorized_*` matches per-row, `size()` grows, `build` on empty, and 
entry in the `supported_type` ↔ `make_group_column` consistency fuzz.
   - Each new type should slot into the dhat-based memory regression harness 
once that is in place (see #22706 review thread).
   
   ## Dependency
   
   Blocks on the `make_group_column` factory landing (PR #22706 sequence). Once 
that is in, each item here is independently mergeable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to