rabenhorst opened a new pull request, #9908:
URL: https://github.com/apache/arrow-rs/pull/9908

   # Which issue does this PR close?
   
   - Closes #9907.
   
   # Rationale for this change
   
   `GenericByteDictionaryBuilder::with_capacity` left the internal `dedup` 
`HashTable` at default capacity (0), forcing 0 → 4 → 8 → 16 → 32 → 64 → … 
resize+rehash cycles on the first inserts even when callers passed a non-zero 
`value_capacity`.
   
   `new()` already sizes `dedup` from `keys_builder.capacity()`; 
`with_capacity` (the variant intended to amortize allocations) should do the 
same from `value_capacity`. As-is, callers reaching for `with_capacity` to 
avoid alloc churn get the opposite of what they asked for on the dedup path.
   
   See #9907 for the full context including a CPU profile showing 
`GenericByteDictionaryBuilder::append` at 7.24% cum where ~half is the 
resize+rehash cost on the first batch.
   
   # What changes are included in this PR?
   
   - `arrow-array/src/builder/generic_bytes_dictionary_builder.rs`: change 
`dedup: Default::default()` to `dedup: 
HashTable::with_capacity(value_capacity)` in `with_capacity`. One-line 
behavioral change.
   - Regression test `test_with_capacity_presizes_dedup` asserting 
`dedup.capacity() >= value_capacity` after construction.
   
   This fixes `StringDictionaryBuilder`, `BinaryDictionaryBuilder`, and the 
`Large*` variants (all type aliases over `GenericByteDictionaryBuilder`).
   
   # Are these changes tested?
   
   Yes. New test in 
`arrow-array/src/builder/generic_bytes_dictionary_builder.rs::tests`:
   
   ```rust
   #[test]
   fn test_with_capacity_presizes_dedup() {
       // `with_capacity` must size the dedup `HashTable` from `value_capacity`,
       // otherwise the first inserts force a chain of resize+rehash cycles.
       let value_capacity = 128;
       let builder =
           GenericByteDictionaryBuilder::<Int32Type, 
GenericStringType<i32>>::with_capacity(
               256,
               value_capacity,
               value_capacity * 32,
           );
       assert!(
           builder.dedup.capacity() >= value_capacity,
           "dedup HashTable not pre-sized: got capacity {}, expected >= {}",
           builder.dedup.capacity(),
           value_capacity,
       );
   }
   ```
   
   Verified the test fails on unpatched `main` (`got capacity 0, expected >= 
128`) and passes with the fix. Full `generic_bytes_dictionary_builder` test 
suite (17 tests) still passes. `cargo clippy --lib -- -D warnings` is clean.
   
   I did not add a benchmark — the asymptotic behavior is what matters here 
(eliminating O(value_capacity) resize+rehash work on first use), and the 
existing arrow-rs benchmark suite already exercises dictionary builders. Happy 
to add one if reviewers prefer.
   
   # Are there any user-facing changes?
   
   No public API changes. Behavior change is strictly an allocation/perf 
improvement: callers of `with_capacity` who previously got a zero-capacity 
dedup table now get one sized to `value_capacity`. This means a slightly larger 
upfront allocation in exchange for avoiding the resize+rehash chain. Callers 
passing very large `value_capacity` values intentionally to hint the values 
buffer (but not expecting to insert that many distinct values) will now also 
pre-allocate that much dedup capacity — this matches the documented contract of 
the parameter (`"the number of distinct dictionary values"`).
   
   ---
   
   ## AI-assisted submission disclosure
   
   Per `CONTRIBUTING.md` — this PR was prepared with AI assistance (Claude). 
The bug was found while profiling a downstream Arrow producer; the fix and 
regression test were reviewed and verified locally. The change is one line plus 
a focused test, and I am happy to debug and own follow-ups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to