[I] Speed up hash join build [experiment] [datafusion]

via GitHub Thu, 30 Oct 2025 01:03:39 -0700


Dandandan opened a new issue, #18376:
URL: https://github.com/apache/datafusion/issues/18376


   ### Is your feature request related to a problem or challenge?
   
   If the build side of the join is large, a significant bottleneck can be 
building the hash table.
   We can explore some opportunities to improve the performance of building 
this map.
   
   ### Describe the solution you'd like
   
   **Core Idea**
   
   The slowest part of building the hash map is finding and then inserting the 
items (hash + offset) into the map for each element.
   
   We should be able to test the following:
   
   * Sort the items by hash (and offset) to be able to deduplicate hashes (this 
introduces some overhead but the hope is we can get this back during build of 
the hashmap)
   * We can use insert_unique 
(https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.insert_unique)
 rather than 
https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.entry 
for the first entry
   * Keep on using the previous entry for duplicated elements
   
   If this doesn't involve any regressions, there are some other opportunities 
for further improving the performance by using the sorted property for 
improving the "chain" datastructure as well.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Speed up hash join build [experiment] [datafusion]

Reply via email to