Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
URL: 
https://github.com/apache/commons-collections/pull/83#issuecomment-549147466
 
 
   I built a test suite to see how fast the various options are.  I used the 
GeoNames database as a datasource and built hashed the names of the first 10^6 
entries.
   I took a sample comprising every 1000th entry.
   
   I ran populations of 10^2 thru 10^6 entries stored in the bloom filter.
   The filter was defined as having 10^4 entries with a collision probability 
of 5x10^4.
   An MD5 hash was used.
   
   All hashes were created outside the timing loops.
   The values we are looking for are called "target" below.  The filter we are 
comparing with is called "candidate" below.
   
   There were 4 tests:
   **Int** - Where an array of integers representing the ON bits for the target 
where checked against a bitSet built from the candidate.  Only the comparison 
was timed, bitset and array of ints were created outside of the timing loop.  
Duplicate values were removed from the target array.
   **long** - Where arrays of longs representing the ON bits for the target 
were compared against an array of longs for the candidate.  Only the comparison 
was timed, long arrays were created outside of the timing loop.
   **hasher** - Each target was encoded into a StaticHasher (basically a set of 
integers) and then the hasher was compared with the candidate Bloom filter 
using the BloomFilter.contains( Hasher ) method.
   **filter** - Each target Bloom filter was constructed using the 
BitSetBloomFilter and then the BloomFilter.contains( BloomFilter ) was called 
on the candidate.  The construction of the target Bloom filter was outside the 
timing loop.
   
   Each test used the 1000 samples 5 times and an average taken.  The result is 
that The int and the long checks are so fast that they rarely take 1 
millisecond.  The hasher check is slightly slower but still less than 1 
millisecond on average.  The full filter creation is slower but never took more 
than 94 milliseconds on average.
   
   I have attached csv files of the data from the runs and the summary in a zip 
file.  If you want the test code please let me know.
   
   
[speedTest.zip](https://github.com/apache/commons-collections/files/3801701/speedTest.zip)
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to