Rich-T-kid opened a new issue, #22078:
URL: https://github.com/apache/datafusion/issues/22078

   ### Is your feature request related to a problem or challenge?
   
   Currently the DictionaryGroupValues path is faster than GroupValuesRows, but 
there is still room for improvement. seen_elements stores the raw bytes of each 
element as a Vec<u8> within a Vec. The frequent allocations this causes are 
minor but do show up as CPU spend in intern(). The current collision handling 
also forces a copy: bytes are stored in both seen_elements and 
unique_dict_value_mapping.
   
   ### Describe the solution you'd like
   
   This can be resolved by storing intermediate bytes in a single contiguous 
buffer, then tracking offsets and lengths instead of raw bytes. We'd introduce 
a new field on the struct that holds the buffer, and seen_elements / 
unique_dict_value_mapping would only need to store an offset and length per 
entry. This would replace a potentially large byte copy with two i32s.
   
   ### Describe alternatives you've considered
   
   the alternative is to not change anything. benchmarks show that even with 
the current approach its faster than the default GroupValuesRow approach.
   
   ### Additional context
   
   see #21765 
   
   <img width="832" height="370" alt="Image" 
src="https://github.com/user-attachments/assets/c25c4f95-5aef-45a9-b6bc-01080e254ff9";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to