Dandandan opened a new issue, #18376: URL: https://github.com/apache/datafusion/issues/18376
### Is your feature request related to a problem or challenge? If the build side of the join is large, a significant bottleneck can be building the hash table. We can explore some opportunities to improve the performance of building this map. ### Describe the solution you'd like **Core Idea** The slowest part of building the hash map is finding and then inserting the items (hash + offset) into the map for each element. We should be able to test the following: * Sort the items by hash (and offset) to be able to deduplicate hashes (this introduces some overhead but the hope is we can get this back during build of the hashmap) * We can use insert_unique (https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.insert_unique) rather than https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.entry for the first entry * Keep on using the previous entry for duplicated elements If this doesn't involve any regressions, there are some other opportunities for further improving the performance by using the sorted property for improving the "chain" datastructure as well. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
