zhuqi-lucas opened a new pull request, #22706:
URL: https://github.com/apache/datafusion/pull/22706
## Which issue does this PR close?
Closes #22682.
## Rationale for this change
Today `multi_group_by::supported_schema` is all-or-nothing: if any single
GROUP BY column is not in the `supported_type` allow-list, the whole grouping
falls back to the byte-encoded `GroupValuesRows` path, even when every other
column would have qualified for the fast column-wise + cross-column
short-circuit path. The most common offender in real workloads is a single
nested column (e.g. `LargeList<Struct<Utf8, LargeUtf8>>`) that drags an
otherwise primitive-heavy 30+ column GROUP BY onto the slow path.
This PR adds column-native `GroupColumn` implementations for
`FixedSizeList<primitive>`, `Struct<...>`, `List<T>`, and `LargeList<T>`, so a
GROUP BY containing any of these no longer falls back from `GroupValuesColumn`
to `GroupValuesRows`.
## What changes are included in this PR?
- `FixedSizeListGroupValueBuilder<T>` for `FixedSizeList<primitive>`. Uses
an inner `PrimitiveGroupValueBuilder` for the flat child storage.
- `StructGroupValueBuilder` for `Struct<...>` with recursive child support.
- `ListGroupValueBuilder<O: OffsetSizeTrait>` for `List<T>` (O=i32) and
`LargeList<T>` (O=i64) with recursive child support. Null outer rows occupy a
zero-length range so the offset buffer stays monotonic without consuming child
slots.
- A recursive `make_group_column(field: &Field) -> Result<Box<dyn
GroupColumn>>` factory in `multi_group_by/mod.rs` that replaces the inline
match in `GroupValuesColumn::intern`. Walks nested fields and recursively
builds children for `Struct` / `List` / `LargeList`.
- `supported_type` recursively checks children for `List` / `LargeList` /
`Struct`. `FixedSizeList` is restricted to primitive children in this PR;
composing FSL with nested children can follow in a separate PR.
## Are these changes tested?
Yes. 38 new tests, 68/68 group_values tests pass, 142/142 aggregates
regression pass.
Per-builder:
- append / build round trip
- `equal_to` for outer nulls, inner nulls, identical, different (and for
List: different lengths, empty)
- **sliced input array** (offset != 0) for each builder
- `take_n` boundary cases: `n=0`, `n=len`, prefix containing null rows
- `vectorized_equal_to` / `vectorized_append` match per-row reference
- `size()` grows with appends
- `build` on empty builder returns empty array
Composition:
- `Struct<Struct<Utf8, Int32>>` (two-level nested struct)
- `List<List<Int32>>` (two-level nested list)
- `LargeList<Struct<Utf8, Int32>>` end-to-end through `new_group_values`
Invariants:
- `supported_type` accepts every primitive + nested combo and rejects
`Float16`, `Decimal256`, `Time64(Second)`, `List<Float16>`, `Struct{Float16}`,
`FixedSizeList<Utf8>`, etc.
- A `supported_type` <-> `make_group_column` consistency fuzz over 21
supported + 9 unsupported types asserts the biconditional. Pins the invariant
against future drift between the allow-list and the dispatcher.
## Are there any user-facing changes?
No public API changes. The new builders are internal to `physical-plan`.
`supported_type` becomes more permissive (returns `true` for the new types), so
existing GROUP BYs that were previously falling back to `GroupValuesRows` will
now use `GroupValuesColumn` automatically; this is a performance improvement,
not a behavior change.
## Known limitations
1. `FixedSizeList` is restricted to primitive child types. Composing FSL
with `Struct` / `List` children is straightforward via the factory but not in
this PR.
2. `vectorized_equal_to` / `vectorized_append` for the new builders fall
back to per-row loops. They are correctness-equivalent to the specialized
vectorized paths used by the primitive builders, but a follow-up can add
type-specialized batched comparators where beneficial.
3. `Map<K, V>` not included; ordering semantics need a separate discussion.
Happy to split this into 4-5 smaller PRs (factory refactor, then one PR per
type) if maintainers prefer reviewing the staging incrementally. The single
commit is structured so the factory and the per-type builders are mutually
dependent and easier to read together.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]