[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674532#comment-15674532 ]
Reynold Xin commented on SPARK-18252: ------------------------------------- I'm not sure if it is worth fixing this: 1. We already compress data before we send them across the network. 2. This is also not a backward compatible change would require different versioning. 3. This brings in extra dependency for a package that has 0 external dependency. 4. We very likely will implement vectorized probing for bloom filter to be used in Spark SQL joins, and using roaring bitmap would make that a lot harder to do. > Improve serialized BloomFilter size > ----------------------------------- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.0.1 > Reporter: Aleksey Ponkin > Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org