Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

via GitHub Thu, 11 Sep 2025 23:35:11 -0700


Dandandan commented on issue #17171:
URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3280484021


   So my recommendation here would be:
   
   * Start out with sharing the table as is, and filter on that. For small 
build side (max couple of MBs, which is roughly the setting now for 
"CollectLeft" mode) should be fast as there is no cost of creating the bloom 
filter, has no false positives, the memory is reused between filter and join 
(might stay in cache) and should be only a small number of instructions per 
value.
   * Check out cases where it isn't as fast, maybe we can skip the hash 
equality check (e.g. accept some false positives)? I am not sure if it is 
really needed though, generally this is not super hot.
   * For larger build sides (not fitting in CPU caches) check if we should use 
bloom filters to compress the table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

Reply via email to