sandy-sachin7 opened a new pull request, #10165:
URL: https://github.com/apache/arrow-rs/pull/10165

   ## Which issue does this PR close?
   
   Closes #10160.
   
   ## Rationale for this change
   
   When concatenating (or interleaving) dictionary arrays with different 
backing arrays, the dictionary values were naively concatenated — potentially 
producing duplicate entries. Downstream consumers like pandas reject this 
because they require unique dictionary categories.
   
   The old heuristic in `should_merge_dictionary_values` only triggered 
dictionary merging when `total_values >= total_entries`, which missed cases 
where small dictionaries with overlapping values were concatenated.
   
   ## What changes are included in this PR?
   
   1. **`arrow-select/src/dictionary.rs`**: Changed 
`should_merge_dictionary_values` to always return `true` for merging when 
dictionaries have different backing arrays (`!single_dictionary`). Removed the 
`values_exceed_length` heuristic that previously gated merging. Removed the 
now-unused `len` parameter.
   
   2. **`arrow-select/src/concat.rs`**: Updated `concat_dictionaries` to pass 
the new signature. Updated `test_string_dictionary_array` to expect 6 merged 
unique values instead of 7 naive concatenated values. Added 
`concat_dictionary_batches_deduplicates_values` test reproducing the exact 
issue scenario.
   
   3. **`arrow-select/src/interleave.rs`**: Updated `interleave_dictionaries` 
to pass the new signature. Updated `test_interleave_dictionary` to expect 3 
merged unique values instead of 5.
   
   ## Are these changes tested?
   
   Yes — all 379 existing tests pass, plus the new reproducing test.
   
   ## Are there any user-facing changes?
   
   Dictionary arrays produced by `concat`, `concat_batches`, and `interleave` 
will now always have deduplicated dictionary values when the input arrays have 
different backing dictionaries. This may reduce the size of the resulting 
dictionary values array, but the logical data (key → value mappings) remains 
identical.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to