Dandandan commented on issue #17171: URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3280484021
So my recommendation here would be: * Start out with sharing the table as is, and filter on that. For small build side (max couple of MBs, which is roughly the setting now for "CollectLeft" mode) should be fast as there is no cost of creating the bloom filter, has no false positives, the memory is reused between filter and join (might stay in cache) and should be only a small number of instructions per value. * Check out cases where it isn't as fast, maybe we can skip the hash equality check (e.g. accept some false positives)? I am not sure if it is really needed though, generally this is not super hot. * For larger build sides (not fitting in CPU caches) check if we should use bloom filters to compress the table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org