nathanb9 opened a new pull request, #22322: URL: https://github.com/apache/datafusion/pull/22322
## Which issue does this PR close? - Closes #. ## Rationale for this change Benchmark: multi-column GROUP BY performance Adds a benchmark for GroupValues implementations to characterize when the vectorized per-column approach (GroupValuesColumn) outperforms the row-based approach (GroupValuesRows) and vice versa. Background: For multi-column GROUP BY, DataFusion unconditionally uses a vectorized implementation that compares group keys column-by-column. However, when the number of distinct groups is small relative to input rows (low cardinality), the simpler row-based approach (single memcmp per row) is faster. The vectorized path runs a multi-phase pipeline for every batch — building index vectors, calling per-column equal_to, checking boolean result buffers, handling remainders — regardless of whether most rows are just matching existing groups. When 99%+ of rows are hits against the same few hundred groups, that machinery adds cost without benefit. The row-based path simply does: hash → find bucket → memcmp → done. ## What changes are included in this PR? How it works: - Generates temporary Parquet files with Int32 columns of controlled cardinality so min/max statistics are available - Uses target_partitions(1) to isolate single-threaded aggregation behavior - Warms the OS page cache before measuring steady-state execution Test cases: - Fixed low group count, vary column count (2→10) — shows vectorized overhead grows linearly with columns - Fixed high group count (~1M), vary column count (2→10) — confirms vectorized wins at scale - Fixed 4 columns, vary group count (16→62B) — identifies the crossover point (~10K groups) ## Are these changes tested? - cargo fmt --all - cargo check -p datafusion --bench multi_group_by --features parquet - cargo clippy --all-targets --all-features -- -D warnings ## Are there any user-facing changes? No. This adds a benchmark only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
