kudhru commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-861695720


   > The implementation here seems to be obviously looking correct, but could 
we add some unit test for it? (e.g. in `BloomFilterSuite.scala`)
   > 
   
   Added a test as suggested.
   
   > In addition, it looks useful, but just wondering what's the motivation to 
add this? Is there any future change in Spark depending on this? I just checked 
[Guava's BloomFilter does not support this bitwise add as 
well](https://guava.dev/releases/20.0/api/docs/com/google/common/hash/BloomFilter.html).
   
   I was implementing [this 
paper](https://dl.acm.org/doi/10.1145/3267809.3267834) on filtering the 
non-overlapping keys from the tables participating in SQL join operation but I 
could not find an AND function required for combining the bloom filters 
belonging to each joining table. I also looked at how the authors of this paper 
[implemented this 
filtering](https://github.com/approxjoin/benchmarks/blob/master/micro-benchs/src/main/scala/ApproxJoinFlitering.scala)
 and found that they used a [third party bloom filter 
library](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/util/BloomFilter.scala).
 Hence, I thought it should be worthwhile to have this functionality in the 
spark repo itself. Let me know if this sounds ok or if you any doubts or 
confusions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to