rkrishn7 commented on issue #17171: URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3313002604
Wanted to bring in some discussion here around tradeoffs on how exactly we should push down hashes. Building on some from https://github.com/apache/datafusion/pull/17529#discussion_r2353830615 On the build side, each partition maintains its own hash map - which is important since the probe side for that partition makes use of the same hash map to discover matches. When pushing down these hashes to the scan node, _all_ the hashes must be available to check against. This is because the concept of partitioning occurs later on in the query plan (namely, during re-partitioning). That is, scanning a table can yield rows from any partition that is established later on for the join columns. I think there are a few ways to go about it: 1. Push down a list of hash maps for each partition. Since we know the repartition seed, we can find out which hash map in the list corresponds to a given row in O(1). Then, we check against that map if the hash exists. - We're able to utilize the same amount of memory with this approach, however I found it to have substantially worse performance than the next option. My guess here is that hashbrown hashmaps exhibit cache friendliness _within_ a single map. I imagine the pointer indirection -> scattered memory access that results from looking up any one of N hash maps plays a role in the performance hit. 2. As build maps complete, dump their hashes into a different hash set that contains all valid hashes on the probe side. - This consumes more memory (~ total # of build rows * sizeof(u64)). However, I found this to be much more performant, even when testing across different scale factors. 3. We could try to alter the join hash map to contain partition information. So, build sides still operate independently. Once complete they drain into a different map that is indexes by (hash, partition). - Gut reaction is that this is probably not worth it since we're still storing an extra usize per hash/partition and I think this would introduce some contention on the probe side. Currently, I'm thinking we should go with option 2 (which is what I've done in https://github.com/apache/datafusion/pull/17529). Since this feature would be opt-in, we could also probably provide a max memory limit when building the separate hash set to mitigate the effect of bad join order and large build sides. But would appreciate other ideas and/or thoughts from folks here cc @adriangb @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org