rabenhorst opened a new issue, #9907:
URL: https://github.com/apache/arrow-rs/issues/9907
**Describe the bug**
`GenericByteDictionaryBuilder::with_capacity` does not pre-size the internal
`dedup` HashTable, even when callers pass a non-zero `value_capacity`. The
dedup table starts at capacity 0 and resizes 0 → 4 → 8 → 16 → 32 → 64 → … on
the first inserts, with each resize being an allocation + full rehash of all
existing entries.
`new()` already does the right thing (sizes `dedup` from
`keys_builder.capacity()`), but `with_capacity()` — which is intended to
amortize allocations — silently regresses to default capacity.
**To Reproduce**
`arrow-array/src/builder/generic_bytes_dictionary_builder.rs` — current
`main`:
```rust
// new() — correctly sized
pub fn new() -> Self {
let keys_builder = PrimitiveBuilder::new();
let values_builder = GenericByteBuilder::<T>::new();
Self {
state: Default::default(),
dedup: HashTable::with_capacity(keys_builder.capacity()), // <--
pre-sized
keys_builder,
values_builder,
}
}
// with_capacity() — NOT sized, ignores value_capacity for dedup
pub fn with_capacity(
keys_capacity: usize,
value_capacity: usize,
data_capacity: usize,
) -> Self {
Self {
state: Default::default(),
dedup: Default::default(), // <-- starts at 0
keys_builder: PrimitiveBuilder::with_capacity(keys_capacity),
values_builder:
GenericByteBuilder::<T>::with_capacity(value_capacity, data_capacity),
}
}
```
**Expected behavior**
`with_capacity(_, value_capacity, _)` should size `dedup` so it can hold at
least `value_capacity` entries without resizing.
**Profiling evidence**
Hit while profiling a high-throughput Arrow producer (Kafka collector →
dictionary-encoded string columns, batches of ~1k rows, ~50–500 unique values
per column). CPU profile (~50k samples):
| Function | flat % | cum % |
|---|---|---|
| `GenericByteDictionaryBuilder::append` | 3.29% | 7.24% |
Children of `append`:
| Child | % of `append` cum |
|---|---|
| `hashbrown::HashTable::entry` | 24.71% |
| `ahash::RandomState::hash_one` | 10.07% |
| `RawVec::grow_one` | 3.91% |
| libc realloc/memcpy | ~7.8% |
The combined `entry` + `hash_one` + `grow_one` cost matches the
resize+rehash signature — work scaling with HashTable growth, not with
steady-state inserts. Once the table is sized to the working set, `entry` is
SIMD-accelerated and cheap.
**Proposed fix**
One-line change in `with_capacity`:
```rust
dedup: HashTable::with_capacity(value_capacity),
```
I'll open a PR with this fix and a regression test asserting
`dedup.capacity() >= value_capacity` after `with_capacity()`.
**Additional context**
- Affects all `GenericByteDictionaryBuilder` callers using `with_capacity`
(`StringDictionaryBuilder`, `BinaryDictionaryBuilder`, and the `Large*`
variants via `LargeStringDictionaryBuilder` / `LargeBinaryDictionaryBuilder`).
- `PrimitiveDictionaryBuilder` may have the same shape — worth checking in a
follow-up.
---
This issue and the accompanying PR are AI-assisted (Claude). The fix and
regression test were reviewed and verified locally — the test fails with
capacity 0 on unpatched `main` and passes after the one-line change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]