Rich-T-kid commented on PR #21589:
URL: https://github.com/apache/datafusion/pull/21589#issuecomment-4246074228

   ### Optimization 1 — Hash caching
   
   Instead of hashing each row's value individually, pre-compute hashes for all 
d distinct values in the values array once per batch and cache them in a 
Vec<u64> indexed by dictionary key. For a batch of n rows with d distinct 
values this reduces hash computations from O(n) to O(d).
   
   ### Optimization 2 — Eliminate ScalarValue
   
   Replace HashMap<ScalarValue, usize> with a structure that stores raw hashes 
and raw string slices pointing directly into the Arrow buffer. This eliminates 
per-row heap allocation (ScalarValue::try_from_array) and deallocation 
(drop_in_place) which the profiler shows accounts for ~60% of intern time 
combined.
   
   ### Optimization 3 — Pre-allocate using occupancy
   
   Use dict_array.occupancy().count_set_bits() to determine the number of truly 
distinct non-null values in the batch upfront and pre-allocate internal storage 
accordingly. This avoids incremental Vec growth during intern. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to