zhuqi-lucas commented on issue #22682:
URL: https://github.com/apache/datafusion/issues/22682#issuecomment-4599071711

   POC branch with all five nested types implemented and 38 new tests is up at 
https://github.com/zhuqi-lucas/arrow-datafusion/tree/qizhu/df-nested-group-column
 (single commit, +2468 / -185 across 4 files).
   
   ## What it adds
   
   - `FixedSizeListGroupValueBuilder<T>` for `FixedSizeList<primitive>`.
   - `StructGroupValueBuilder` for `Struct<...>` with recursive child support.
   - `ListGroupValueBuilder<O>` (generic over `i32` / `i64` offsets) for 
`List<T>` and `LargeList<T>`, also with recursive child support.
   - A recursive `make_group_column(field: &Field) -> Result<Box<dyn 
GroupColumn>>` factory that replaces the inline match in 
`GroupValuesColumn::intern`. This is the structural change that lets nested 
types compose (`LargeList<Struct<Utf8, LargeUtf8>>`, `List<List<Int32>>`, etc.) 
without enumerating every combination.
   - Recursive `supported_type` that walks into List / LargeList / Struct 
children.
   
   ## Test coverage (68/68 group_values tests pass, 142/142 aggregates pass)
   
   Per-builder:
   
   - append / build round trip
   - equal_to for outer nulls, inner nulls, identical, different lists
   - **sliced input array** (offset != 0) for each builder
   - take_n boundary cases: n=0, n=len, and prefix containing null rows
   - vectorized_equal_to / vectorized_append match per-row reference
   - size() grows with appends
   - build on empty builder returns empty array
   
   Composition:
   
   - `Struct<Struct<Utf8, Int32>>` (two-level nested struct)
   - `List<List<Int32>>` (two-level nested list)
   - `LargeList<Struct<Utf8, Int32>>` end-to-end through `new_group_values`
   
   Invariants:
   
   - `supported_type` recursively accepts every primitive + nested combo and 
rejects Float16, Decimal256, Time64(Second), List<Float16>, Struct{Float16}, 
FixedSizeList<Utf8>, etc.
   - A `supported_type` <-> `make_group_column` consistency fuzz over 21 
supported + 9 unsupported types asserts the biconditional. Pins the invariant 
against future drift between the allow-list and the dispatcher.
   
   ## Production validation context
   
   The motivating workload (production materialized view with a 
`LargeList<Struct<Utf8, LargeUtf8>>` column in a 40-column GROUP BY) was 
independently verified to drop peak memory ~50% (25 GiB to 15 GiB) when the 
offending nested column is removed from the schema entirely, confirming that 
the `GroupValuesRows` fallback driven by a single unsupported column is the 
real bottleneck. This PR lets that workload keep the column **and** stay on the 
column-wise + short-circuit fast path.
   
   ## Suggested PR split
   
   The commit is one piece because the factory refactor and the per-type 
builders are mutually dependent. Happy to split into 4-5 PRs along the staging 
in the issue if maintainers prefer:
   
   1. Factory + recursive `supported_type` (no new types yet)
   2. `FixedSizeList<primitive>`
   3. `List<T>` and `LargeList<T>`
   4. `Struct<...>`
   5. End-to-end composition + consistency fuzz tests
   
   ## Caveats called out
   
   - `vectorized_*` methods use per-row loops in this POC. 
Correctness-equivalent to specialized vectorized paths; performance 
optimization can follow.
   - `FixedSizeList` is restricted to primitive child types here. Composing FSL 
with Struct/List children is straightforward via the factory but not in this 
POC.
   - `Map<K, V>` not included; ordering semantics need a separate discussion.
   
   Also relevant: I opened #22701 proposing a generic `FallbackGroupColumn` so 
any Arrow type can go through `GroupValuesColumn`, with the specializations in 
this PR layered on top as opt-in fast paths. The two issues are complementary, 
not exclusive.
   
   Happy to turn this into a real PR (or sequence of PRs) whenever maintainers 
want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to