[
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674767#comment-15674767
]
Reynold Xin commented on SPARK-18252:
-------------------------------------
For 3, the sketch package has no external dependency, and was created
explicitly this way so bloom filter built in Spark can be used in other
applications without having to worry about dependency conflicts.
For 4, it is much easier to just create a vectorized version of the probing
code when all we are dealing with is a simple for loop.
> Improve serialized BloomFilter size
> -----------------------------------
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.0.1
> Reporter: Aleksey Ponkin
> Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation -
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current
> implementation is using custom class org.apache.spark.util.sketch.BitArray
> for bit vector, which is allocating memory for the whole filter no matter how
> many elements are set. Since BloomFilter can be serialized and sent over
> network in distribution stage may be we need to use some kind of compressed
> bloom filters? For example
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or
> [javaewah][https://github.com/lemire/javaewah].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]