Dandandan opened a new pull request, #21347: URL: https://github.com/apache/datafusion/pull/21347
## Which issue does this PR close? N/A - performance optimization ## Rationale for this change Profiling ClickBench queries showed `__bzero` (from `buffer.resize(n, 0)`) as ~1-2% of CPU time in `with_hashes` / `create_hashes`. The zero-fill is unnecessary when hash functions write all positions including nulls. ## What changes are included in this PR? Hash functions now write all buffer positions when `rehash=false` (first column), using a consistent null sentinel hash (`random_state.hash_one(1u8)`) for null positions. This allows `with_hashes` to skip the zero-fill entirely. - `with_hashes`: use `set_len` instead of `resize(n, 0)` — avoids memset - `hash_array_primitive` / `hash_array`: fill with null sentinel, then overwrite valid positions via `valid_indices()` - `hash_string_view_array_inner`: write null sentinel instead of `continue` for null positions - `hash_dictionary_inner`: write null sentinel for null keys/values - `hash_run_array_inner`: fill null run ranges with sentinel - `create_hashes`: zero-fill only for complex types (struct, list, map, union) whose hash functions always combine with existing values Benchmark results (`with_hashes` bench, int64 single column): - No nulls: 3.3µs → 2.8µs (**~17% faster**) - With nulls: ~9.6µs (unchanged — fill + iterate is same cost as resize + iterate) ## Are these changes tested? Existing tests updated to expect non-zero null sentinel hash values. ## Are there any user-facing changes? No. Null positions now get a consistent non-zero hash instead of 0, but this is an internal implementation detail. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
