zhuqi-lucas opened a new pull request, #23128: URL: https://github.com/apache/datafusion/pull/23128
PR 2 of the EPIC #22715 sequence. Builds on the dispatcher refactor + factory framework that landed in PR 1 (#22751) to bring the first nested type into \`GroupValuesColumn\`, unblocking the rest of the sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL, PR 7 Map). ## Which issue does this PR close? Part of #22682 (nested type coverage) and #22715 (EPIC). ## Rationale for this change Today a single \`FixedSizeList<primitive>\` column in a GROUP BY drags the whole grouping onto the byte-encoded \`GroupValuesRows\` fallback, even when every other column would have qualified for the column-wise + cross-column short-circuit fast path in \`GroupValuesColumn\`. With the recursive \`make_group_column\` factory + \`group_column_supported_type\` allow-list in place from PR 1, this PR is now a self-contained addition of one builder and two dispatcher entries. ## What changes are included in this PR? - \`FixedSizeListGroupValueBuilder<T: ArrowPrimitiveType>\` in a new \`fixed_size_list\` submodule. Storage: outer null bitmap + a child \`PrimitiveGroupValueBuilder<T, true>\` that holds every element flat (length = \`outer_len * list_len\`). Element \`j\` of outer row \`i\` lives at child index \`i * list_len + j\`. - \`group_column_supported_type\` accepts \`FixedSizeList<primitive>\` for the primitive subset wired through the dispatcher (Int8..Int64, UInt8..UInt64, Float32, Float64, Date32, Date64). Composite children (Struct / List inside FSL) are deferred to a follow-up after their respective builders land. - \`make_group_column\` dispatches \`DataType::FixedSizeList(...)\` to the appropriate \`FixedSizeListGroupValueBuilder<T>\` via an \`instantiate_fsl!\` macro mirroring the existing \`instantiate_primitive!\` pattern. ## Are these changes tested? Yes. 12 new unit tests on \`FixedSizeListGroupValueBuilder\` plus extensions to the existing \`group_column_supported_type\` ⇔ \`make_group_column\` consistency fuzz. Per-builder: - append / build round trip with mixed outer-null and inner-null rows - \`equal_to\` for identical / different / outer-null / inner-null rows - \`take_n\` boundary cases (\`n=0\`, \`n=len\`, prefix containing null rows) - sliced input array (offset != 0) - \`vectorized_equal_to\` / \`vectorized_append\` match per-row reference - \`size()\` grows with appends - \`build\` on empty builder returns empty array - end-to-end dispatcher → \`GroupValuesColumn\` routing Consistency fuzz now covers four primitive FSL children (signed, unsigned, float, date) on the supported side and three non-primitive FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking in the scope boundary for this PR. 127/127 aggregates tests pass, clippy clean, fmt clean. ## What follows in the EPIC - PR 3: \`Struct<...>\` builder + dispatcher. - PR 4: \`List<T>\` builder + dispatcher (recursive child via factory). - PR 5: \`LargeList<T>\` builder + dispatcher. - PR 6: Relax FSL child restriction to allow \`FSL<Struct>\` / \`FSL<List>\` once the prerequisite child builders are in. - PR 7: \`Map<K,V>\` (\`List<Struct<keys, values>>\` Arrow representation). ## Are there any user-facing changes? No behavior change for users whose GROUP BY did not previously contain a \`FixedSizeList<primitive>\` column. For users who did, the grouping now uses the column-native fast path instead of falling back to \`GroupValuesRows\` — same results, less memory and CPU. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
