Dandandan opened a new pull request, #20958:
URL: https://github.com/apache/datafusion/pull/20958

   ## Summary
   - Replace `ahash` with `foldhash` in `datafusion-common` hashing 
(`with_hashes`/`create_hashes`)
   - Use `SeedableRandomState` for rehash paths: fold existing hash into 
hasher's initial state, eliminating the separate `combine_hashes` step
   - Add `hash_write` method to `HashValue` trait for writing values into an 
existing hasher
   - Use `valid_indices()` iterator for null paths instead of per-element 
`is_null()` checks
   
   ## Benchmark results (int64, 8192 rows, Apple M1)
   
   | Benchmark | Before (ahash) | After (foldhash) | Improvement |
   |---|---|---|---|
   | single array, no nulls | 5.65 µs | 3.30 µs | **-42%** |
   | multiple arrays, no nulls | 22.15 µs | 11.19 µs | **-49%** |
   | single array, nulls | 11.94 µs | 9.47 µs | **-21%** |
   | multiple arrays, nulls | 36.92 µs | 29.80 µs | **-19%** |
   
   String view improvements (utf8_view, 8192 rows):
   
   | Benchmark | Improvement |
   |---|---|
   | single, no nulls | **-13%** |
   | multiple, no nulls | **-28%** |
   | small strings, single | **-55%** |
   | small strings, multiple | **-60%** |
   
   ## Test plan
   - [x] All 36 `hash_utils` unit tests pass
   - [ ] Run full CI suite
   - [ ] Verify downstream crates still compile (they use `ahash::RandomState` 
directly — this PR only changes `datafusion-common`, further migration needed)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to