Rich-T-kid commented on PR #21765: URL: https://github.com/apache/datafusion/pull/21765#issuecomment-4401926750
Bench marks <img width="941" height="1230" alt="Image 5-7-26 at 7 20 PM" src="https://github.com/user-attachments/assets/50125d91-f674-41e0-a030-8979170ad79c" /> <img width="1131" height="1308" alt="Image 5-7-26 at 7 20 PM (1)" src="https://github.com/user-attachments/assets/1596ff90-ac19-454a-91ba-b00f4642173f" /> <img width="955" height="565" alt="Image 5-7-26 at 7 20 PM (2)" src="https://github.com/user-attachments/assets/6eff0054-68c2-46bb-b24d-e692521caf59" /> the benchmarks in `physical-plan/benches/dictionary_group_values.rs` as well as the `datafusion/benchmarks/dict.rs` show a meaningful improvement. But I think there are still some improvements that can be made to make it even more efficent, One idea I have is to store intermediate bytes in one buffer as opposed to a vector of bytes, this removes the double memory allocation that is currently happening in intern. another improvement is to add caching to the value hashes that are computed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
