RyanJamesStewart opened a new pull request, #22172:
URL: https://github.com/apache/datafusion/pull/22172

   ## Which issue does this PR close?
   
   Closes #20054.
   
   ## Rationale for this change
   
   `ArrowBytesViewMap::insert_if_new_inner` is the hot loop behind `COUNT 
DISTINCT` and `GROUP BY` over `StringViewArray` and `BinaryViewArray` columns. 
The new-value branch calls `values.value(i)` to reconstruct bytes from the 
views buffer, but for inline strings (len <= 12) those bytes are already packed 
in the local `view_u128`. The Arrow inline ByteView format stores `[len:u32 
LE][data:12 bytes zero-padded]` inside the u128, so `values.value(i)` performs 
an avoidable buffer deref per first-seen distinct inline value.
   
   Dandandan flagged the same gap in the PR #19975 review thread that spawned 
this issue: "We should avoid this for inlined values (and it's the same as 
above `input_value`, so now it does it twice." The fix has not landed in the 
five months since.
   
   The equivalence is bit-level: for an inline input, `make_view(value, 0, 0)` 
produces a u128 byte-identical to the input `view_u128` (same length in low 4 
bytes, same data in high 12 bytes, zero-padded the same way). Pushing 
`view_u128` directly into `self.views` produces the same stored state as the 
round trip through `values.value(i)` and `append_value`.
   
   AI-assistance disclosure: this PR was developed with AI assistance. The 
equivalence argument, the choice to defer the non-inline K-collision case to a 
separate PR, and the decision to add a path-specific bench (since the existing 
`aggregate_vectorized.rs` exercises `ByteViewGroupValueBuilder`, not this map) 
are decisions I understand end-to-end and can justify in review. No unknowns 
are masked.
   
   ## What changes are included in this PR?
   
   - New helper `fn append_inline_view(&mut self, view: u128) -> u128` in 
`binary_view_map.rs`.
   - `insert_if_new_inner` new-value path branches on `len <= 12`: inline case 
extracts bytes from `view_u128.to_le_bytes()[4..4 + len as usize]`, calls the 
new helper, and skips the `values.value(i)` plus `make_view` round trip. The 
non-inline path is unchanged.
   - New bench at 
`datafusion/physical-expr-common/benches/binary_view_map_insert.rs` exercising 
the patched code path on `StringViewArray` with all-distinct inline strings, 
plus a registration entry in the crate's `Cargo.toml`.
   
   The non-inline K-collision case (where `values.value(i)` is called K+1 times 
when there are K hash and prefix collisions) is a separate concern and is 
deferred to a future PR if it proves worth pursuing.
   
   ## Are these changes tested?
   
   `cargo test -p datafusion-physical-expr-common --lib binary_view_map`: 8 of 
8 existing tests pass. The existing suite covers inline, non-inline, binary, 
and non-UTF-8 paths; any byte-order or off-by-one error in the inline view 
extraction would surface as a wrong hash bucket or a wrong lookup result.
   
   `cargo clippy -p datafusion-physical-expr-common --all-targets -- -D 
warnings`: clean.
   
   The new bench measures `ArrowBytesViewSet_insert_if_new` on all-distinct 
`StringViewArray` inputs at len=8 (worst case: every row hits the new-value 
branch). Results on a Ryzen 9 9950X:
   
   | n | baseline | patched | delta |
   |---|----------|---------|-------|
   | 1,000 | 11.73 µs | 10.53 µs | -8.9% |
   | 10,000 | 150.6 µs | 135.4 µs | -10.1% |
   | 100,000 | 1697.6 µs | 1620.1 µs | -4.6% |
   
   Criterion reports "Performance has improved" at all three sizes with p = 
0.00. The delta narrows at 100K because the hash map itself starts dominating 
when all values are distinct; production workloads with repeated values benefit 
proportionally to the first-seen-distinct count.
   
   ## Are there any user-facing changes?
   
   No public API changes. `append_inline_view` is private to the crate; 
`insert_if_new_inner` keeps its signature. The `api change` label is not 
required.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to