zhuqi-lucas opened a new issue, #22682:
URL: https://github.com/apache/datafusion/issues/22682

   ### Is your feature request related to a problem or challenge?
   
   When a `GROUP BY` includes any column with a nested type (`List`, 
`LargeList`, `FixedSizeList`, `Struct`, `Map`), 
`multi_group_by::supported_type` returns false for that column and the whole 
grouping falls back to the byte-encoded `GroupValuesRows` path via 
`arrow_row::Rows`, even if every other column would be supported by 
`GroupValuesColumn`. Today's `supported_type` 
([source](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs#L1213-L1239)):
   
   ```rust
   fn supported_type(data_type: &DataType) -> bool {
       matches!(*data_type,
           DataType::Int8 | Int16 | ... | Float64
           | Decimal128(_, _)
           | Utf8 | LargeUtf8 | Utf8View
           | Binary | LargeBinary | BinaryView
           | Date32 | Date64 | Time32(_) | Timestamp(_, _)
           | Boolean
       )
   }
   ```
   
   ### Impact
   
   A wide multi-column `GROUP BY` where most columns are individually supported 
by `GroupValuesColumn` (Utf8 / Date / Boolean / Float) but one column is a 
nested type (e.g. `LargeList<Struct<...>>`) is forced onto the slow 
`arrow_row::Rows` path. Group key memory grows materially compared to the 
equivalent column-format storage, and per-group comparisons become byte-by-byte 
rather than vectorized.
   
   ### Describe the solution you'd like
   
   Add `GroupColumn` implementations for the missing nested types and extend 
`supported_type` accordingly. Equality is well-defined by Arrow's 
`make_equal`-style comparators; hashing of nested values has existing arrow 
utilities.
   
   Suggested incremental order, easiest first:
   
   1. `FixedSizeList<T>` where `T` is a `GroupColumn`-supported primitive 
(fixed-width per row, simplest)
   2. `List<T>` and `LargeList<T>` with primitive child (variable-length, 
single offset buffer)
   3. `Struct<...>` with all fields being `GroupColumn`-supported (composition 
of existing columns under a single null mask)
   4. `Map<K, V>` (most complex due to entry ordering semantics)
   
   Each can be its own PR. The existing `arrow_row::Rows` path remains the 
fallback for any nested type not yet supported, so this is purely an 
optimization.
   
   ### Describe alternatives you've considered
   
   - **arrow_row fallback (current state)**: correct, but significantly more 
memory and CPU than `GroupValuesColumn` when the grouping mostly consists of 
cheap columns and the nested column is one of many.
   - **Avoid grouping by the nested column at the SQL level**: works 
case-by-case (e.g. wrap in `ANY_VALUE`) but does not help when the nested 
column is the legitimate group key.
   
   ### Additional context
   
   Issue #22526 is a separate, related pathology on the emit side. This issue 
is about the input/intern side (which `GroupValues` impl is selected). The two 
are independent and can land in either order.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to