adriangb commented on issue #17171: URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3266394775
I'm keen to try bloom filters (hence https://github.com/apache/datafusion/issues/16435). I don't have numbers to back it up (it's not implemented) but: 1. Like Andrew said several other systems use bloom filters for precisely this purpose. 2. From the CMU lecture above: "Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table [4]. Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches."; this is not a benchmark on our system but the logic is sound to me, bloom filters are very efficient data structures. 3. Bloom filters can be serialized across the wire and can be represented in a platform agnostic way so they are compatible with distributed users of DataFusion. I think the main argument *against* bloom filters and *in favor of* pushing down a reference to the hash table itself is that building a bloom filter can be expensive (I remember @Dandandan making this point). My counter argument to this would be that this is only a problem if the size of your build side ≈ the size of your probe side, but if that's the case you already probably have a suboptimal or slow query (and maybe a hash join wasn't even the right choice). If your build side is small then the overhead of building a bloom filter should be small relative to the speedup you get if lookups are even 15% faster than a hash table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org