nathanb9 opened a new pull request, #22322:
URL: https://github.com/apache/datafusion/pull/22322

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   Benchmark: multi-column GROUP BY performance
   
   Adds a benchmark for GroupValues implementations to characterize when the 
vectorized per-column approach (GroupValuesColumn) outperforms the row-based 
approach (GroupValuesRows) and vice versa.
   
   Background: For multi-column GROUP BY, DataFusion unconditionally uses a 
vectorized implementation that compares group keys column-by-column. However, 
when the number of distinct groups is small relative to input rows (low 
cardinality), the simpler row-based approach (single memcmp per row) is faster. 
The vectorized path runs a multi-phase pipeline for every batch — building 
index vectors, calling per-column equal_to, checking boolean result buffers, 
handling remainders — regardless of whether most rows are just matching 
existing groups. When 99%+ of rows are hits against the same few hundred 
groups, that machinery adds cost without benefit. The row-based path simply 
does: hash → find bucket → memcmp → done.
   
   ## What changes are included in this PR?
   
   How it works:
   - Generates temporary Parquet files with Int32 columns of controlled 
cardinality so min/max statistics are available
   - Uses target_partitions(1) to isolate single-threaded aggregation behavior
   - Warms the OS page cache before measuring steady-state execution
   
   Test cases:
   - Fixed low group count, vary column count (2→10) — shows vectorized 
overhead grows linearly with columns
   - Fixed high group count (~1M), vary column count (2→10) — confirms 
vectorized wins at scale
   - Fixed 4 columns, vary group count (16→62B) — identifies the crossover 
point (~10K groups)
   
   ## Are these changes tested?
   
   - cargo fmt --all
   - cargo check -p datafusion --bench multi_group_by --features parquet
   - cargo clippy --all-targets --all-features -- -D warnings
   
   ## Are there any user-facing changes?
   
   No. This adds a benchmark only.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to