Dandandan opened a new pull request, #21344: URL: https://github.com/apache/datafusion/pull/21344
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` to close issue #123. --> Related to #15961 ## Rationale for this change Profiling `SELECT COUNT(DISTINCT "UserID") FROM hits` (ClickBench) showed `GroupValuesPrimitive::intern` as a hot spot, with `hashbrown::raw::RawTable::reserve_rehash` and `GroupValuesPrimitive::intern` dominating the flamegraph. ## What changes are included in this PR? Two optimizations for the single-column primitive GROUP BY hot path: 1. **Vectorized hashing**: Split `intern` into two phases — batch hash computation via `with_hashes` (tight loop, better CPU pipelining) followed by hash table probing with pre-computed hashes. The original code interleaved hash computation with hash table probing on every row, preventing the CPU from pipelining the hash computation. 2. **Inline values in hash table**: Store the actual value in each hash table entry `(usize, T::Native)` instead of `(usize, u64)` with an indirect lookup into a separate `values` vec. This eliminates one cache miss per probe (no pointer chase from hash table entry → values array) and removes the need to store the hash — the value can be rehashed from the inline copy when needed (rare, only during table growth). ## Are these changes tested? Existing tests cover this code path. ## Are there any user-facing changes? No, this is a performance optimization only. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
