Claude Warren created COLLECTIONS-821:
-----------------------------------------

             Summary: BloomFilter: replace ArrayTracker
                 Key: COLLECTIONS-821
                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-821
             Project: Commons Collections
          Issue Type: Improvement
          Components: Collection
    Affects Versions: 4.5
            Reporter: Claude Warren


h3. 
!https://avatars.githubusercontent.com/u/886334?s=48&v=4|width=24,height=24!  
*[aherbert|https://github.com/aherbert]*  [on 27 
Feb|https://github.com/apache/commons-collections/pull/258#discussion_r813388387]
The Hasher.ArrayTracker class could be replaced with an open addressed hash 
table. I have created an implementation that is very efficient and can handle 
up to 2^29 unique indices (with a load factor of 0.5). It is approximately 7x 
faster than TreeSet for add operations, even without explicitly setting the 
upper capacity. It is only outperformed by a BitMap type array when the density 
of indices exceeds 2-4 bits per long which matches up to how the IndexFilter is 
choosing the implementation type based on size. I added the simple array 
tracker to the benchmark as it is still faster when the number of hash 
functions is very small (e.g. 5).

The same int hash table can be used for the SparseBloomFilter but it would not 
support a saturated filter if the number of bits is above 2^29. The second 
caveat is that iteration of indices in order is slow, they are naturally 
unsorted. Iteration in unsorted raw form is fast (as fast as TreeMap) but 
sorted requires extracting the indices and sorting them which is about 4-6x 
slower than TreeMap iteration. The sorted form is required to support the 
BitMapProducer interface. This then becomes a question of whether to optimise a 
sparse filter for merge and contains operations or for bitmap iteration 
(required for set operations).

Currently this and the BitMapTracker are public. I would move them to package 
private implementation details (i.e. out of this class). This allows them to be 
changed later. All access should be through {{{}IndexTracker.create{}}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to