Rich-T-kid opened a new issue, #22078: URL: https://github.com/apache/datafusion/issues/22078
### Is your feature request related to a problem or challenge? Currently the DictionaryGroupValues path is faster than GroupValuesRows, but there is still room for improvement. seen_elements stores the raw bytes of each element as a Vec<u8> within a Vec. The frequent allocations this causes are minor but do show up as CPU spend in intern(). The current collision handling also forces a copy: bytes are stored in both seen_elements and unique_dict_value_mapping. ### Describe the solution you'd like This can be resolved by storing intermediate bytes in a single contiguous buffer, then tracking offsets and lengths instead of raw bytes. We'd introduce a new field on the struct that holds the buffer, and seen_elements / unique_dict_value_mapping would only need to store an offset and length per entry. This would replace a potentially large byte copy with two i32s. ### Describe alternatives you've considered the alternative is to not change anything. benchmarks show that even with the current approach its faster than the default GroupValuesRow approach. ### Additional context see #21765 <img width="832" height="370" alt="Image" src="https://github.com/user-attachments/assets/c25c4f95-5aef-45a9-b6bc-01080e254ff9" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
