sandy-sachin7 opened a new pull request, #10165: URL: https://github.com/apache/arrow-rs/pull/10165
## Which issue does this PR close? Closes #10160. ## Rationale for this change When concatenating (or interleaving) dictionary arrays with different backing arrays, the dictionary values were naively concatenated — potentially producing duplicate entries. Downstream consumers like pandas reject this because they require unique dictionary categories. The old heuristic in `should_merge_dictionary_values` only triggered dictionary merging when `total_values >= total_entries`, which missed cases where small dictionaries with overlapping values were concatenated. ## What changes are included in this PR? 1. **`arrow-select/src/dictionary.rs`**: Changed `should_merge_dictionary_values` to always return `true` for merging when dictionaries have different backing arrays (`!single_dictionary`). Removed the `values_exceed_length` heuristic that previously gated merging. Removed the now-unused `len` parameter. 2. **`arrow-select/src/concat.rs`**: Updated `concat_dictionaries` to pass the new signature. Updated `test_string_dictionary_array` to expect 6 merged unique values instead of 7 naive concatenated values. Added `concat_dictionary_batches_deduplicates_values` test reproducing the exact issue scenario. 3. **`arrow-select/src/interleave.rs`**: Updated `interleave_dictionaries` to pass the new signature. Updated `test_interleave_dictionary` to expect 3 merged unique values instead of 5. ## Are these changes tested? Yes — all 379 existing tests pass, plus the new reproducing test. ## Are there any user-facing changes? Dictionary arrays produced by `concat`, `concat_batches`, and `interleave` will now always have deduplicated dictionary values when the input arrays have different backing dictionaries. This may reduce the size of the resulting dictionary values array, but the logical data (key → value mappings) remains identical. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
