[
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksey Ponkin updated SPARK-18252:
-----------------------------------
Description: Since version 2.0 Spark has BloomFilter implementation -
org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current
implementation is using custom class org.apache.spark.util.sketch.BitArray for
bit vector, which is allocating memory for the whole filter no matter how many
elements are set. Since BloomFilter can be serialized and sent over network in
distribution stage may be we need to use some kind of compressed bloom filters?
For example [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or
[javaewah][https://github.com/lemire/javaewah]. (was: Since version 2.0 Spark
has BloomFilter implementation - org.apache.spark.util.sketch.BloomFilterImpl.
I have noticed that current implementation are using custom class
org.apache.spark.util.sketch.BitArray, which are allocating memory for filter
in the begining. So even filters with small number of elements inserted will be
preatty large when there will be a need of serialization. Is there any interest
to use [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or
[javaewah][https://github.com/lemire/javaewah] to compress bloom filters or may
be compress them during serialization stage. )
> Improve serialized BloomFilter size
> -----------------------------------
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.0.1
> Reporter: Aleksey Ponkin
> Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation -
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current
> implementation is using custom class org.apache.spark.util.sketch.BitArray
> for bit vector, which is allocating memory for the whole filter no matter how
> many elements are set. Since BloomFilter can be serialized and sent over
> network in distribution stage may be we need to use some kind of compressed
> bloom filters? For example
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or
> [javaewah][https://github.com/lemire/javaewah].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]