zhuqi-lucas commented on issue #22682: URL: https://github.com/apache/datafusion/issues/22682#issuecomment-4599071711
POC branch with all five nested types implemented and 38 new tests is up at https://github.com/zhuqi-lucas/arrow-datafusion/tree/qizhu/df-nested-group-column (single commit, +2468 / -185 across 4 files). ## What it adds - `FixedSizeListGroupValueBuilder<T>` for `FixedSizeList<primitive>`. - `StructGroupValueBuilder` for `Struct<...>` with recursive child support. - `ListGroupValueBuilder<O>` (generic over `i32` / `i64` offsets) for `List<T>` and `LargeList<T>`, also with recursive child support. - A recursive `make_group_column(field: &Field) -> Result<Box<dyn GroupColumn>>` factory that replaces the inline match in `GroupValuesColumn::intern`. This is the structural change that lets nested types compose (`LargeList<Struct<Utf8, LargeUtf8>>`, `List<List<Int32>>`, etc.) without enumerating every combination. - Recursive `supported_type` that walks into List / LargeList / Struct children. ## Test coverage (68/68 group_values tests pass, 142/142 aggregates pass) Per-builder: - append / build round trip - equal_to for outer nulls, inner nulls, identical, different lists - **sliced input array** (offset != 0) for each builder - take_n boundary cases: n=0, n=len, and prefix containing null rows - vectorized_equal_to / vectorized_append match per-row reference - size() grows with appends - build on empty builder returns empty array Composition: - `Struct<Struct<Utf8, Int32>>` (two-level nested struct) - `List<List<Int32>>` (two-level nested list) - `LargeList<Struct<Utf8, Int32>>` end-to-end through `new_group_values` Invariants: - `supported_type` recursively accepts every primitive + nested combo and rejects Float16, Decimal256, Time64(Second), List<Float16>, Struct{Float16}, FixedSizeList<Utf8>, etc. - A `supported_type` <-> `make_group_column` consistency fuzz over 21 supported + 9 unsupported types asserts the biconditional. Pins the invariant against future drift between the allow-list and the dispatcher. ## Production validation context The motivating workload (production materialized view with a `LargeList<Struct<Utf8, LargeUtf8>>` column in a 40-column GROUP BY) was independently verified to drop peak memory ~50% (25 GiB to 15 GiB) when the offending nested column is removed from the schema entirely, confirming that the `GroupValuesRows` fallback driven by a single unsupported column is the real bottleneck. This PR lets that workload keep the column **and** stay on the column-wise + short-circuit fast path. ## Suggested PR split The commit is one piece because the factory refactor and the per-type builders are mutually dependent. Happy to split into 4-5 PRs along the staging in the issue if maintainers prefer: 1. Factory + recursive `supported_type` (no new types yet) 2. `FixedSizeList<primitive>` 3. `List<T>` and `LargeList<T>` 4. `Struct<...>` 5. End-to-end composition + consistency fuzz tests ## Caveats called out - `vectorized_*` methods use per-row loops in this POC. Correctness-equivalent to specialized vectorized paths; performance optimization can follow. - `FixedSizeList` is restricted to primitive child types here. Composing FSL with Struct/List children is straightforward via the factory but not in this POC. - `Map<K, V>` not included; ordering semantics need a separate discussion. Also relevant: I opened #22701 proposing a generic `FallbackGroupColumn` so any Arrow type can go through `GroupValuesColumn`, with the specializations in this PR layered on top as opt-in fast paths. The two issues are complementary, not exclusive. Happy to turn this into a real PR (or sequence of PRs) whenever maintainers want. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
