Dandandan opened a new pull request, #21348:
URL: https://github.com/apache/datafusion/pull/21348

   ## Which issue does this PR close?
   
   N/A - performance optimization
   
   ## Rationale for this change
   
   Profiling `SELECT COUNT(DISTINCT "SearchPhrase") FROM hits` (ClickBench) 
showed `ArrowBytesViewMap::insert_if_new_inner` as a hot spot, with 
`_platform_memcmp` at 5% and `append_value` at 7% of CPU time.
   
   ## What changes are included in this PR?
   
   Three optimizations for the BytesView hash map hot path:
   
   1. **Direct value bytes access**: Replace `values.value(i).as_ref()` (which 
goes through `GenericByteViewArray::value()` accessor — bounds check, view 
decode, buffer lookup) with direct pointer arithmetic on `input_views` + 
`input_buffers`. This avoids the accessor overhead on every hash table probe 
for >12 byte strings.
   
   2. **Skip append for inline strings**: For strings ≤12 bytes, the input view 
is self-contained (length + data encoded in the u128). Instead of decoding to 
`&[u8]` and re-encoding via `append_value` → `make_view`, push the input view 
directly. This avoids a decode-encode round trip for the most common case 
(empty/short strings).
   
   3. **Simplify `make_payload_fn`**: Change signature from 
`FnMut(Option<&[u8]>) -> V` to `FnMut() -> V` since no caller uses the value 
bytes parameter. This eliminates unnecessary value decoding on the insert path.
   
   ## Are these changes tested?
   
   Existing tests pass. Test updated to match simplified `make_payload_fn` 
signature.
   
   ## Are there any user-facing changes?
   
   `ArrowBytesViewMap::insert_if_new` has a changed `make_payload_fn` signature 
(breaking API change for downstream users of this internal API).
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to