RyanJamesStewart opened a new pull request, #22172: URL: https://github.com/apache/datafusion/pull/22172
## Which issue does this PR close? Closes #20054. ## Rationale for this change `ArrowBytesViewMap::insert_if_new_inner` is the hot loop behind `COUNT DISTINCT` and `GROUP BY` over `StringViewArray` and `BinaryViewArray` columns. The new-value branch calls `values.value(i)` to reconstruct bytes from the views buffer, but for inline strings (len <= 12) those bytes are already packed in the local `view_u128`. The Arrow inline ByteView format stores `[len:u32 LE][data:12 bytes zero-padded]` inside the u128, so `values.value(i)` performs an avoidable buffer deref per first-seen distinct inline value. Dandandan flagged the same gap in the PR #19975 review thread that spawned this issue: "We should avoid this for inlined values (and it's the same as above `input_value`, so now it does it twice." The fix has not landed in the five months since. The equivalence is bit-level: for an inline input, `make_view(value, 0, 0)` produces a u128 byte-identical to the input `view_u128` (same length in low 4 bytes, same data in high 12 bytes, zero-padded the same way). Pushing `view_u128` directly into `self.views` produces the same stored state as the round trip through `values.value(i)` and `append_value`. AI-assistance disclosure: this PR was developed with AI assistance. The equivalence argument, the choice to defer the non-inline K-collision case to a separate PR, and the decision to add a path-specific bench (since the existing `aggregate_vectorized.rs` exercises `ByteViewGroupValueBuilder`, not this map) are decisions I understand end-to-end and can justify in review. No unknowns are masked. ## What changes are included in this PR? - New helper `fn append_inline_view(&mut self, view: u128) -> u128` in `binary_view_map.rs`. - `insert_if_new_inner` new-value path branches on `len <= 12`: inline case extracts bytes from `view_u128.to_le_bytes()[4..4 + len as usize]`, calls the new helper, and skips the `values.value(i)` plus `make_view` round trip. The non-inline path is unchanged. - New bench at `datafusion/physical-expr-common/benches/binary_view_map_insert.rs` exercising the patched code path on `StringViewArray` with all-distinct inline strings, plus a registration entry in the crate's `Cargo.toml`. The non-inline K-collision case (where `values.value(i)` is called K+1 times when there are K hash and prefix collisions) is a separate concern and is deferred to a future PR if it proves worth pursuing. ## Are these changes tested? `cargo test -p datafusion-physical-expr-common --lib binary_view_map`: 8 of 8 existing tests pass. The existing suite covers inline, non-inline, binary, and non-UTF-8 paths; any byte-order or off-by-one error in the inline view extraction would surface as a wrong hash bucket or a wrong lookup result. `cargo clippy -p datafusion-physical-expr-common --all-targets -- -D warnings`: clean. The new bench measures `ArrowBytesViewSet_insert_if_new` on all-distinct `StringViewArray` inputs at len=8 (worst case: every row hits the new-value branch). Results on a Ryzen 9 9950X: | n | baseline | patched | delta | |---|----------|---------|-------| | 1,000 | 11.73 µs | 10.53 µs | -8.9% | | 10,000 | 150.6 µs | 135.4 µs | -10.1% | | 100,000 | 1697.6 µs | 1620.1 µs | -4.6% | Criterion reports "Performance has improved" at all three sizes with p = 0.00. The delta narrows at 100K because the hash map itself starts dominating when all values are distinct; production workloads with repeated values benefit proportionally to the first-seen-distinct count. ## Are there any user-facing changes? No public API changes. `append_inline_view` is private to the crate; `insert_if_new_inner` keeps its signature. The `api change` label is not required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
