adriangb commented on issue #17171:
URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3266394775

   I'm keen to try bloom filters (hence 
https://github.com/apache/datafusion/issues/16435). I don't have numbers to 
back it up (it's not implemented) but:
   1. Like Andrew said several other systems use bloom filters for precisely 
this purpose.
   2. From the CMU lecture above: "Create a Bloom Filter during the build phase 
when the key is likely to not exist in the hash
   table [4]. Threads check the filter before probing the hash table. This will 
be faster since the filter will fit in
   CPU caches."; this is not a benchmark on our system but the logic is sound 
to me, bloom filters are very efficient data structures.
   3. Bloom filters can be serialized across the wire and can be represented in 
a platform agnostic way so they are compatible with distributed users of 
DataFusion.
   
   I think the main argument *against* bloom filters and *in favor of* pushing 
down a reference to the hash table itself is that building a bloom filter can 
be expensive (I remember @Dandandan making this point). My counter argument to 
this would be that this is only a problem if the size of your build side ≈ the 
size of your probe side, but if that's the case you already probably have a 
suboptimal or slow query (and maybe a hash join wasn't even the right choice). 
If your build side is small then the overhead of building a bloom filter should 
be small relative to the speedup you get if lookups are even 15% faster than a 
hash table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to