nathanb9 opened a new pull request, #22479:
URL: https://github.com/apache/datafusion/pull/22479

   ## Summary
   
   - Adds `GroupValuesFlatPrimitive`, a direct-indexing `GroupValues` 
implementation for integer GROUP BY columns with bounded value ranges (no 
hashing, O(1) array lookup via `value - min`)
   - Adds a benchmark (`single_group_by_primitive`) comparing hash-based 
`GroupValuesPrimitive` vs `GroupValuesFlatPrimitive`
   - Makes `single_group_by` and `row` modules public for benchmark access
   
   ### Benchmark design
   
   Uses the same `iter_batched_ref` methodology as the multi-column GROUP BY 
benchmark in #22322:
   - Construction is in setup (not timed)
   - Only `intern()` calls are measured
   - `black_box` prevents dead-code elimination
   
   Three experiments:
   1. **Group count sweep** (10–100K groups, 1M rows) — measures scaling with 
cardinality
   2. **Density sweep** (10K groups, 10%–100% density) — measures flat array 
sparsity impact
   3. **Row count scaling** (10K groups, 1M–10M rows) — measures per-row cost 
compounding
   
   ### Local results (Apple Silicon, release mode)
   
   | Groups | Hash | Flat | Speedup |
   |--------|------|------|---------|
   | 10 | 1.41ms | 0.96ms | **1.47x** |
   | 100 | 1.38ms | 0.97ms | **1.43x** |
   | 1,000 | 1.52ms | 0.98ms | **1.55x** |
   | 10,000 | 2.31ms | 0.97ms | **2.37x** |
   | 100,000 | 4.53ms | 1.52ms | **2.99x** |
   
   ## Related
   
   - Closes benchmarking aspect of 
https://github.com/apache/datafusion/issues/19938
   - Inspired by `ArrayMap` in `joins/array_map.rs` (perfect hash join, PR 
#19411)
   - Complements multi-column benchmark in #22322
   
   ## Test plan
   
   - [x] `cargo test -p datafusion-physical-plan --lib flat_primitive` — 5 unit 
tests pass
   - [x] `cargo clippy -p datafusion-physical-plan --benches -- -D warnings` — 
clean
   - [x] `cargo bench -p datafusion-physical-plan --bench 
single_group_by_primitive` — runs successfully
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to