Rich-T-kid commented on PR #21589: URL: https://github.com/apache/datafusion/pull/21589#issuecomment-4246074228
### Optimization 1 — Hash caching Instead of hashing each row's value individually, pre-compute hashes for all d distinct values in the values array once per batch and cache them in a Vec<u64> indexed by dictionary key. For a batch of n rows with d distinct values this reduces hash computations from O(n) to O(d). ### Optimization 2 — Eliminate ScalarValue Replace HashMap<ScalarValue, usize> with a structure that stores raw hashes and raw string slices pointing directly into the Arrow buffer. This eliminates per-row heap allocation (ScalarValue::try_from_array) and deallocation (drop_in_place) which the profiler shows accounts for ~60% of intern time combined. ### Optimization 3 — Pre-allocate using occupancy Use dict_array.occupancy().count_set_bits() to determine the number of truly distinct non-null values in the batch upfront and pre-allocate internal storage accordingly. This avoids incremental Vec growth during intern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
