Dandandan opened a new pull request, #21604: URL: https://github.com/apache/datafusion/pull/21604
## Which issue does this PR close? N/A - Performance optimization ## Rationale for this change Profiling ClickBench query 5 (`SELECT COUNT(DISTINCT "SearchPhrase") FROM hits`) shows ~29% of time in `GroupValuesBytesView` (hash table probing) and ~13% in `create_hashes`. In this workload ~90% of `SearchPhrase` values are empty strings, meaning most hash table probes find an already-existing entry for the same value as the previous row. ## What changes are included in this PR? Adds a single-entry "last value" cache in `ArrowBytesViewMap::insert_if_new_inner`. Before probing the hash table, the loop checks whether the current `view_u128` matches the previous row's view. If so, it reuses the cached payload and skips the hash table `find()` entirely. This is correct for all string lengths: - **Inline strings (≤12 bytes):** the u128 view deterministically encodes the complete value - **Non-inline strings (>12 bytes):** within the same input array, matching views means identical `buffer_index + offset + length`, so they reference the exact same bytes The cost is one `u128` comparison per row (~1 cycle, register/L1). The saving is the hash table `find()` (random memory access pattern) for every consecutive duplicate. ## Are these changes tested? Existing tests in `binary_view_map::tests` pass (8/8). The optimization is transparent — same semantics, same output, just fewer hash table probes. ## Are there any user-facing changes? No. This is a performance improvement only. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
