Hongbo Xu created HADOOP-11829:
----------------------------------

             Summary: Improve the vector size of Bloom Filter from int to long, 
and storage from memory to disk
                 Key: HADOOP-11829
                 URL: https://issues.apache.org/jira/browse/HADOOP-11829
             Project: Hadoop Common
          Issue Type: Improvement
          Components: util
            Reporter: Hongbo Xu
            Assignee: Hongbo Xu
            Priority: Minor


org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int 
hashType) 
This filter almost can insert 900 million objects, when False Positives 
Probability is 0.0001, and it needs 2.1G RAM.
In My project, I needs established a filter which capacity is 2 billion, and it 
needs 4.7G RAM, the vector size is 38340233509, out the range of int, and I 
does not have so much RAM to do this, so I rebuild a big bloom filter which 
vector size type is long, and split the bit data to some files on disk, then 
distribute files to work node, and the performance is very good.
I think I can contribute this code to Hadoop Common, and a 128-bit Hash 
function (MurmurHash)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to