zhuqi-lucas opened a new issue, #22682: URL: https://github.com/apache/datafusion/issues/22682
### Is your feature request related to a problem or challenge? When a `GROUP BY` includes any column with a nested type (`List`, `LargeList`, `FixedSizeList`, `Struct`, `Map`), `multi_group_by::supported_type` returns false for that column and the whole grouping falls back to the byte-encoded `GroupValuesRows` path via `arrow_row::Rows`, even if every other column would be supported by `GroupValuesColumn`. Today's `supported_type` ([source](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs#L1213-L1239)): ```rust fn supported_type(data_type: &DataType) -> bool { matches!(*data_type, DataType::Int8 | Int16 | ... | Float64 | Decimal128(_, _) | Utf8 | LargeUtf8 | Utf8View | Binary | LargeBinary | BinaryView | Date32 | Date64 | Time32(_) | Timestamp(_, _) | Boolean ) } ``` ### Impact A wide multi-column `GROUP BY` where most columns are individually supported by `GroupValuesColumn` (Utf8 / Date / Boolean / Float) but one column is a nested type (e.g. `LargeList<Struct<...>>`) is forced onto the slow `arrow_row::Rows` path. Group key memory grows materially compared to the equivalent column-format storage, and per-group comparisons become byte-by-byte rather than vectorized. ### Describe the solution you'd like Add `GroupColumn` implementations for the missing nested types and extend `supported_type` accordingly. Equality is well-defined by Arrow's `make_equal`-style comparators; hashing of nested values has existing arrow utilities. Suggested incremental order, easiest first: 1. `FixedSizeList<T>` where `T` is a `GroupColumn`-supported primitive (fixed-width per row, simplest) 2. `List<T>` and `LargeList<T>` with primitive child (variable-length, single offset buffer) 3. `Struct<...>` with all fields being `GroupColumn`-supported (composition of existing columns under a single null mask) 4. `Map<K, V>` (most complex due to entry ordering semantics) Each can be its own PR. The existing `arrow_row::Rows` path remains the fallback for any nested type not yet supported, so this is purely an optimization. ### Describe alternatives you've considered - **arrow_row fallback (current state)**: correct, but significantly more memory and CPU than `GroupValuesColumn` when the grouping mostly consists of cheap columns and the nested column is one of many. - **Avoid grouping by the nested column at the SQL level**: works case-by-case (e.g. wrap in `ANY_VALUE`) but does not help when the nested column is the legitimate group key. ### Additional context Issue #22526 is a separate, related pathology on the emit side. This issue is about the input/intern side (which `GroupValues` impl is selected). The two are independent and can land in either order. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
