zhuqi-lucas opened a new pull request, #22706:
URL: https://github.com/apache/datafusion/pull/22706

   ## Which issue does this PR close?
   
   Closes #22682.
   
   ## Rationale for this change
   
   Today `multi_group_by::supported_schema` is all-or-nothing: if any single 
GROUP BY column is not in the `supported_type` allow-list, the whole grouping 
falls back to the byte-encoded `GroupValuesRows` path, even when every other 
column would have qualified for the fast column-wise + cross-column 
short-circuit path. The most common offender in real workloads is a single 
nested column (e.g. `LargeList<Struct<Utf8, LargeUtf8>>`) that drags an 
otherwise primitive-heavy 30+ column GROUP BY onto the slow path.
   
   This PR adds column-native `GroupColumn` implementations for 
`FixedSizeList<primitive>`, `Struct<...>`, `List<T>`, and `LargeList<T>`, so a 
GROUP BY containing any of these no longer falls back from `GroupValuesColumn` 
to `GroupValuesRows`.
   
   ## What changes are included in this PR?
   
   - `FixedSizeListGroupValueBuilder<T>` for `FixedSizeList<primitive>`. Uses 
an inner `PrimitiveGroupValueBuilder` for the flat child storage.
   - `StructGroupValueBuilder` for `Struct<...>` with recursive child support.
   - `ListGroupValueBuilder<O: OffsetSizeTrait>` for `List<T>` (O=i32) and 
`LargeList<T>` (O=i64) with recursive child support. Null outer rows occupy a 
zero-length range so the offset buffer stays monotonic without consuming child 
slots.
   - A recursive `make_group_column(field: &Field) -> Result<Box<dyn 
GroupColumn>>` factory in `multi_group_by/mod.rs` that replaces the inline 
match in `GroupValuesColumn::intern`. Walks nested fields and recursively 
builds children for `Struct` / `List` / `LargeList`.
   - `supported_type` recursively checks children for `List` / `LargeList` / 
`Struct`. `FixedSizeList` is restricted to primitive children in this PR; 
composing FSL with nested children can follow in a separate PR.
   
   ## Are these changes tested?
   
   Yes. 38 new tests, 68/68 group_values tests pass, 142/142 aggregates 
regression pass.
   
   Per-builder:
   - append / build round trip
   - `equal_to` for outer nulls, inner nulls, identical, different (and for 
List: different lengths, empty)
   - **sliced input array** (offset != 0) for each builder
   - `take_n` boundary cases: `n=0`, `n=len`, prefix containing null rows
   - `vectorized_equal_to` / `vectorized_append` match per-row reference
   - `size()` grows with appends
   - `build` on empty builder returns empty array
   
   Composition:
   - `Struct<Struct<Utf8, Int32>>` (two-level nested struct)
   - `List<List<Int32>>` (two-level nested list)
   - `LargeList<Struct<Utf8, Int32>>` end-to-end through `new_group_values`
   
   Invariants:
   - `supported_type` accepts every primitive + nested combo and rejects 
`Float16`, `Decimal256`, `Time64(Second)`, `List<Float16>`, `Struct{Float16}`, 
`FixedSizeList<Utf8>`, etc.
   - A `supported_type` <-> `make_group_column` consistency fuzz over 21 
supported + 9 unsupported types asserts the biconditional. Pins the invariant 
against future drift between the allow-list and the dispatcher.
   
   ## Are there any user-facing changes?
   
   No public API changes. The new builders are internal to `physical-plan`. 
`supported_type` becomes more permissive (returns `true` for the new types), so 
existing GROUP BYs that were previously falling back to `GroupValuesRows` will 
now use `GroupValuesColumn` automatically; this is a performance improvement, 
not a behavior change.
   
   ## Known limitations
   
   1. `FixedSizeList` is restricted to primitive child types. Composing FSL 
with `Struct` / `List` children is straightforward via the factory but not in 
this PR.
   2. `vectorized_equal_to` / `vectorized_append` for the new builders fall 
back to per-row loops. They are correctness-equivalent to the specialized 
vectorized paths used by the primitive builders, but a follow-up can add 
type-specialized batched comparators where beneficial.
   3. `Map<K, V>` not included; ordering semantics need a separate discussion.
   
   Happy to split this into 4-5 smaller PRs (factory refactor, then one PR per 
type) if maintainers prefer reviewing the staging incrementally. The single 
commit is structured so the factory and the per-type builders are mutually 
dependent and easier to read together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to