Claude Warren created COLLECTIONS-821:
-----------------------------------------
Summary: BloomFilter: replace ArrayTracker
Key: COLLECTIONS-821
URL: https://issues.apache.org/jira/browse/COLLECTIONS-821
Project: Commons Collections
Issue Type: Improvement
Components: Collection
Affects Versions: 4.5
Reporter: Claude Warren
h3.
!https://avatars.githubusercontent.com/u/886334?s=48&v=4|width=24,height=24!
*[aherbert|https://github.com/aherbert]* [on 27
Feb|https://github.com/apache/commons-collections/pull/258#discussion_r813388387]
The Hasher.ArrayTracker class could be replaced with an open addressed hash
table. I have created an implementation that is very efficient and can handle
up to 2^29 unique indices (with a load factor of 0.5). It is approximately 7x
faster than TreeSet for add operations, even without explicitly setting the
upper capacity. It is only outperformed by a BitMap type array when the density
of indices exceeds 2-4 bits per long which matches up to how the IndexFilter is
choosing the implementation type based on size. I added the simple array
tracker to the benchmark as it is still faster when the number of hash
functions is very small (e.g. 5).
The same int hash table can be used for the SparseBloomFilter but it would not
support a saturated filter if the number of bits is above 2^29. The second
caveat is that iteration of indices in order is slow, they are naturally
unsorted. Iteration in unsorted raw form is fast (as fast as TreeMap) but
sorted requires extracting the indices and sorting them which is about 4-6x
slower than TreeMap iteration. The sorted form is required to support the
BitMapProducer interface. This then becomes a question of whether to optimise a
sparse filter for merge and contains operations or for bitmap iteration
(required for set operations).
Currently this and the BitMapTracker are public. I would move them to package
private implementation details (i.e. out of this class). This allows them to be
changed later. All access should be through {{{}IndexTracker.create{}}}.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)