rabenhorst commented on PR #9908: URL: https://github.com/apache/arrow-rs/pull/9908#issuecomment-4381537330
Thanks @etseidl @alamb — sharing the cleanest before/after we have. Rather than a synthetic microbenchmark we captured two 30-minute CPU profiles against the same workload, once with stock `arrow-array` 58.1.0 (the latest published release at the time, which has the bug), and once with this PR's branch pulled in via `[patch.crates-io]`. ### Setup Workload: Arrow producer encoding OTel span data into dict-encoded `StringDictionaryBuilder` columns. ~50–500 unique values per column per batch, batches of ~1k rows — exactly the regime where the unpatched dedup HashTable resizes 0 → 4 → 8 → 16 → 32 → 64 on the first inserts. Profile sample counts are comparable: 49,535 (unpatched) vs 50,718 (patched). ### Results (% of total CPU) | Frame | Unpatched 58.1.0 | This PR | Δ | |---|---|---|---| | `GenericByteDictionaryBuilder::append` cum | 7.24% | **5.11%** | **−29%** | | `GenericByteDictionaryBuilder::append` flat (self) | 3.29% | **0.72%** | **−78%** | | └─ `RawVec::grow_one` (child of append) | 0.28% | 0.16% | −43% | | └─ `HashTable::entry` (child of append) | 1.79% | 2.71% | +51% | | └─ `ahash::RandomState::hash_one` (child) | 0.73% | 1.17% | +60% | ### Reading - The clearest signal is `append`'s **flat self-time** dropping 78% (3.29% → 0.72% of total CPU). That self-time was the resize+rehash bookkeeping happening in `append`'s own frame on the unpatched build — exactly what pre-sizing the dedup eliminates. - `RawVec::grow_one` drops 43%. The residual is likely the values builder's own buffer growth (`data_capacity`), which this PR doesn't touch. - `HashTable::entry` and `hash_one` go **up** in absolute share because, with the resize churn gone, the same CPU budget processes more events, so steady-state probe + hash work grows proportionally. Within `append`, the two go from ~35% → ~76% of cum — the "nothing left to optimize cheaply" shape; what remains is the unavoidable SIMD-accelerated probe + hash. - Net: ~2 pp of total CPU recovered on this workload. Happy to add a criterion microbenchmark too if it would help — the asymptotic story is straightforward (pre-sizing eliminates the O(value_capacity) resize+rehash chain on the first batch through each builder), but a synthetic bench could make it reproducible against the existing arrow-array bench suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
