rabenhorst commented on PR #9908:
URL: https://github.com/apache/arrow-rs/pull/9908#issuecomment-4381537330

   Thanks @etseidl @alamb — sharing the cleanest before/after we have. Rather 
than a synthetic microbenchmark we captured two 30-minute CPU profiles against 
the same workload, once with stock `arrow-array` 58.1.0 (the latest published 
release at the time, which has the bug), and once with this PR's branch pulled 
in via `[patch.crates-io]`.
   
   ### Setup
   
   Workload: Arrow producer encoding OTel span data into dict-encoded 
`StringDictionaryBuilder` columns. ~50–500 unique values per column per batch, 
batches of ~1k rows — exactly the regime where the unpatched dedup HashTable 
resizes 0 → 4 → 8 → 16 → 32 → 64 on the first inserts.
   
   Profile sample counts are comparable: 49,535 (unpatched) vs 50,718 (patched).
   
   ### Results (% of total CPU)
   
   | Frame | Unpatched 58.1.0 | This PR | Δ |
   |---|---|---|---|
   | `GenericByteDictionaryBuilder::append` cum | 7.24% | **5.11%** | **−29%** |
   | `GenericByteDictionaryBuilder::append` flat (self) | 3.29% | **0.72%** | 
**−78%** |
   | └─ `RawVec::grow_one` (child of append) | 0.28% | 0.16% | −43% |
   | └─ `HashTable::entry` (child of append) | 1.79% | 2.71% | +51% |
   | └─ `ahash::RandomState::hash_one` (child) | 0.73% | 1.17% | +60% |
   
   ### Reading
   
   - The clearest signal is `append`'s **flat self-time** dropping 78% (3.29% → 
0.72% of total CPU). That self-time was the resize+rehash bookkeeping happening 
in `append`'s own frame on the unpatched build — exactly what pre-sizing the 
dedup eliminates.
   - `RawVec::grow_one` drops 43%. The residual is likely the values builder's 
own buffer growth (`data_capacity`), which this PR doesn't touch.
   - `HashTable::entry` and `hash_one` go **up** in absolute share because, 
with the resize churn gone, the same CPU budget processes more events, so 
steady-state probe + hash work grows proportionally. Within `append`, the two 
go from ~35% → ~76% of cum — the "nothing left to optimize cheaply" shape; what 
remains is the unavoidable SIMD-accelerated probe + hash.
   - Net: ~2 pp of total CPU recovered on this workload.
   
   Happy to add a criterion microbenchmark too if it would help — the 
asymptotic story is straightforward (pre-sizing eliminates the 
O(value_capacity) resize+rehash chain on the first batch through each builder), 
but a synthetic bench could make it reproducible against the existing 
arrow-array bench suite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to