Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

via GitHub Fri, 19 Sep 2025 10:07:10 -0700


rkrishn7 commented on issue #17171:
URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3313002604


   Wanted to bring in some discussion here around tradeoffs on how exactly we 
should push down hashes. Building on some from 
https://github.com/apache/datafusion/pull/17529#discussion_r2353830615
   
   On the build side, each partition maintains its own hash map - which is 
important since the probe side for that partition makes use of the same hash 
map to discover matches. When pushing down these hashes to the scan node, _all_ 
the hashes must be available to check against. This is because the concept of 
partitioning occurs later on in the query plan (namely, during 
re-partitioning). That is, scanning a table can yield rows from any partition 
that is established later on for the join columns.
   
   I think there are a few ways to go about it:
   
   1. Push down a list of hash maps for each partition. Since we know the 
repartition seed, we can find out which hash map in the list corresponds to a 
given row in O(1). Then, we check against that map if the hash exists.
       - We're able to utilize the same amount of memory with this approach, 
however I found it to have substantially worse performance than the next 
option. My guess here is that hashbrown hashmaps exhibit cache friendliness 
_within_ a single map. I imagine the pointer indirection -> scattered memory 
access that results from looking up any one of N hash maps plays a role in the 
performance hit.
   2. As build maps complete, dump their hashes into a different hash set that 
contains all valid hashes on the probe side.
       - This consumes more memory (~ total # of build rows * sizeof(u64)). 
However, I found this to be much more performant, even when testing across 
different scale factors.
   3. We could try to alter the join hash map to contain partition information. 
So, build sides still operate independently. Once complete they drain into a 
different map that is indexes by (hash, partition). 
       - Gut reaction is that this is probably not worth it since we're still 
storing an extra usize per hash/partition and I think this would introduce some 
contention on the probe side.
   
   
   Currently, I'm thinking we should go with option 2 (which is what I've done 
in https://github.com/apache/datafusion/pull/17529). Since this feature would 
be opt-in, we could also probably provide a max memory limit when building the 
separate hash set to mitigate the effect of bad join order and large build 
sides.
   
   But would appreciate other ideas and/or thoughts from folks here
   
   cc @adriangb @Dandandan 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

Reply via email to