rabenhorst opened a new issue, #9907:
URL: https://github.com/apache/arrow-rs/issues/9907

   **Describe the bug**
   
   `GenericByteDictionaryBuilder::with_capacity` does not pre-size the internal 
`dedup` HashTable, even when callers pass a non-zero `value_capacity`. The 
dedup table starts at capacity 0 and resizes 0 → 4 → 8 → 16 → 32 → 64 → … on 
the first inserts, with each resize being an allocation + full rehash of all 
existing entries.
   
   `new()` already does the right thing (sizes `dedup` from 
`keys_builder.capacity()`), but `with_capacity()` — which is intended to 
amortize allocations — silently regresses to default capacity.
   
   **To Reproduce**
   
   `arrow-array/src/builder/generic_bytes_dictionary_builder.rs` — current 
`main`:
   
   ```rust
   // new() — correctly sized
   pub fn new() -> Self {
       let keys_builder = PrimitiveBuilder::new();
       let values_builder = GenericByteBuilder::<T>::new();
       Self {
           state: Default::default(),
           dedup: HashTable::with_capacity(keys_builder.capacity()),  // <-- 
pre-sized
           keys_builder,
           values_builder,
       }
   }
   
   // with_capacity() — NOT sized, ignores value_capacity for dedup
   pub fn with_capacity(
       keys_capacity: usize,
       value_capacity: usize,
       data_capacity: usize,
   ) -> Self {
       Self {
           state: Default::default(),
           dedup: Default::default(),  // <-- starts at 0
           keys_builder: PrimitiveBuilder::with_capacity(keys_capacity),
           values_builder: 
GenericByteBuilder::<T>::with_capacity(value_capacity, data_capacity),
       }
   }
   ```
   
   **Expected behavior**
   
   `with_capacity(_, value_capacity, _)` should size `dedup` so it can hold at 
least `value_capacity` entries without resizing.
   
   **Profiling evidence**
   
   Hit while profiling a high-throughput Arrow producer (Kafka collector → 
dictionary-encoded string columns, batches of ~1k rows, ~50–500 unique values 
per column). CPU profile (~50k samples):
   
   | Function | flat % | cum % |
   |---|---|---|
   | `GenericByteDictionaryBuilder::append` | 3.29% | 7.24% |
   
   Children of `append`:
   
   | Child | % of `append` cum |
   |---|---|
   | `hashbrown::HashTable::entry` | 24.71% |
   | `ahash::RandomState::hash_one` | 10.07% |
   | `RawVec::grow_one` | 3.91% |
   | libc realloc/memcpy | ~7.8% |
   
   The combined `entry` + `hash_one` + `grow_one` cost matches the 
resize+rehash signature — work scaling with HashTable growth, not with 
steady-state inserts. Once the table is sized to the working set, `entry` is 
SIMD-accelerated and cheap.
   
   **Proposed fix**
   
   One-line change in `with_capacity`:
   
   ```rust
   dedup: HashTable::with_capacity(value_capacity),
   ```
   
   I'll open a PR with this fix and a regression test asserting 
`dedup.capacity() >= value_capacity` after `with_capacity()`.
   
   **Additional context**
   
   - Affects all `GenericByteDictionaryBuilder` callers using `with_capacity` 
(`StringDictionaryBuilder`, `BinaryDictionaryBuilder`, and the `Large*` 
variants via `LargeStringDictionaryBuilder` / `LargeBinaryDictionaryBuilder`).
   - `PrimitiveDictionaryBuilder` may have the same shape — worth checking in a 
follow-up.
   
   ---
   
   This issue and the accompanying PR are AI-assisted (Claude). The fix and 
regression test were reviewed and verified locally — the test fails with 
capacity 0 on unpatched `main` and passes after the one-line change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to