[
https://issues.apache.org/jira/browse/HADOOP-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025647#comment-16025647
]
Hongbo Xu commented on HADOOP-11829:
------------------------------------
YES
> Improve the vector size of Bloom Filter from int to long, and storage from
> memory to disk
> -----------------------------------------------------------------------------------------
>
> Key: HADOOP-11829
> URL: https://issues.apache.org/jira/browse/HADOOP-11829
> Project: Hadoop Common
> Issue Type: Improvement
> Components: util
> Reporter: Hongbo Xu
> Assignee: Hongbo Xu
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int
> hashType)
> This filter almost can insert 900 million objects, when False Positives
> Probability is 0.0001, and it needs 2.1G RAM.
> In My project, I needs established a filter which capacity is 2 billion, and
> it needs 4.7G RAM, the vector size is 38340233509, out the range of int, and
> I does not have so much RAM to do this, so I rebuild a big bloom filter which
> vector size type is long, and split the bit data to some files on disk, then
> distribute files to work node, and the performance is very good.
> I think I can contribute this code to Hadoop Common, and a 128-bit Hash
> function (MurmurHash)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]