asubiotto opened a new issue, #9939:
URL: https://github.com/apache/arrow-rs/issues/9939

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   `interleave_dictionaries` checks whether input dictionaries should be 
merged/GCed, but `should_merge_dictionary_values` has limited type support 
because of the byte keyed interner. For other value types, and in the general 
fallback case where the heuristic returns false, 
`interleave_fallback_dictionary` concatenates all the source `values` slices, 
leaving lots of bloat when the interleave selection is small relative to total 
values. This is common in DataFusion multi-partition sorts on dictionary 
columns: I observed a real world sort where the output `dict.values` was 99%+ 
unreferenced.
   
   **Describe the solution you'd like**
   
   In `interleave_fallback_dictionary`, gate on a cheap heuristic, 
`indices.len() < sum(all_values.len()). When it triggers, `take` only the 
referenced positions per Arc distinct group before concatenating. Output is 
logically equivalent; `values.len()` shrinks to the count of distinct 
referenced positions. No public API changes.
   
   On the motivating workload this dropped runtime from ~20 minutes to ~7 
minutes.
   
   **Describe alternatives you've considered**
   
   * Letting downstream `compact_dict`/`gc` strip the bloat after the fact. The 
problem is that the concat cost dominates, so this is still expensive.
   * Always projecting: caused 80%+ regressions on existing 
`interleave_kernels.rs` dict microbenchmarks where every position is genuinely 
referenced.
   
   **Additional context**
   
   I will open a PR
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to