asubiotto opened a new issue, #9939: URL: https://github.com/apache/arrow-rs/issues/9939
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** `interleave_dictionaries` checks whether input dictionaries should be merged/GCed, but `should_merge_dictionary_values` has limited type support because of the byte keyed interner. For other value types, and in the general fallback case where the heuristic returns false, `interleave_fallback_dictionary` concatenates all the source `values` slices, leaving lots of bloat when the interleave selection is small relative to total values. This is common in DataFusion multi-partition sorts on dictionary columns: I observed a real world sort where the output `dict.values` was 99%+ unreferenced. **Describe the solution you'd like** In `interleave_fallback_dictionary`, gate on a cheap heuristic, `indices.len() < sum(all_values.len()). When it triggers, `take` only the referenced positions per Arc distinct group before concatenating. Output is logically equivalent; `values.len()` shrinks to the count of distinct referenced positions. No public API changes. On the motivating workload this dropped runtime from ~20 minutes to ~7 minutes. **Describe alternatives you've considered** * Letting downstream `compact_dict`/`gc` strip the bloat after the fact. The problem is that the concat cost dominates, so this is still expensive. * Always projecting: caused 80%+ regressions on existing `interleave_kernels.rs` dict microbenchmarks where every position is genuinely referenced. **Additional context** I will open a PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
