kudhru commented on pull request #32907: URL: https://github.com/apache/spark/pull/32907#issuecomment-861695720
> The implementation here seems to be obviously looking correct, but could we add some unit test for it? (e.g. in `BloomFilterSuite.scala`) > Added a test as suggested. > In addition, it looks useful, but just wondering what's the motivation to add this? Is there any future change in Spark depending on this? I just checked [Guava's BloomFilter does not support this bitwise add as well](https://guava.dev/releases/20.0/api/docs/com/google/common/hash/BloomFilter.html). I was implementing [this paper](https://dl.acm.org/doi/10.1145/3267809.3267834) on filtering the non-overlapping keys from the tables participating in SQL join operation but I could not find an AND function required for combining the bloom filters belonging to each joining table. I also looked at how the authors of this paper [implemented this filtering](https://github.com/approxjoin/benchmarks/blob/master/micro-benchs/src/main/scala/ApproxJoinFlitering.scala) and found that they used a [third party bloom filter library](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/util/BloomFilter.scala). Hence, I thought it should be worthwhile to have this functionality in the spark repo itself. Let me know if this sounds ok or if you any doubts or confusions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
