nathanb9 opened a new pull request, #22481: URL: https://github.com/apache/datafusion/pull/22481
## Summary - Adds `GroupValuesFlatPrimitive`, a direct-indexing `GroupValues` implementation for integer GROUP BY columns with bounded value ranges (no hashing, O(1) array lookup via `value - min`) - Adds a benchmark (`single_group_by_primitive`) comparing hash-based `GroupValuesPrimitive` vs `GroupValuesFlatPrimitive` - Makes `single_group_by` and `primitive` modules public for benchmark access Supersedes #22479 (rebased on clean main). ### Benchmark design Uses the same `iter_batched_ref` methodology as the multi-column GROUP BY benchmark in #22322: - Construction is in setup (not timed) - Only `intern()` calls are measured - `black_box` prevents dead-code elimination Three experiments: 1. **Group count sweep** (10–100K groups, 1M rows) — measures scaling with cardinality 2. **Density sweep** (10K groups, 10%–100% density) — measures flat array sparsity impact 3. **Row count scaling** (10K groups, 1M–10M rows) — measures per-row cost compounding ### Local results (Apple Silicon, release mode, `iter_batched_ref`) | Groups | Hash | Flat | Speedup | |--------|------|------|---------| | 10 | 1.41ms | 0.96ms | **1.47x** | | 100 | 1.38ms | 0.97ms | **1.43x** | | 1,000 | 1.52ms | 0.98ms | **1.55x** | | 10,000 | 2.31ms | 0.97ms | **2.37x** | | 100,000 | 4.53ms | 1.52ms | **2.99x** | ## Related - Part of https://github.com/apache/datafusion/issues/19938 - Inspired by `ArrayMap` in `joins/array_map.rs` (perfect hash join, PR #19411) - Complements multi-column benchmark in #22322 ## Test plan - [x] `cargo test -p datafusion-physical-plan --lib flat_primitive` — 5 unit tests pass - [x] `cargo clippy -p datafusion-physical-plan --benches -- -D warnings` — clean - [x] `cargo bench -p datafusion-physical-plan --bench single_group_by_primitive` — runs successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
