[ https://issues.apache.org/jira/browse/COLLECTIONS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Claude Warren updated COLLECTIONS-824: -------------------------------------- Summary: BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change (was: BloomFilter: change name of SimpleHasher) > BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change > ---------------------------------------------------------------------------- > > Key: COLLECTIONS-824 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-824 > Project: Commons Collections > Issue Type: Improvement > Components: Collection > Affects Versions: 4.5 > Reporter: Claude Warren > Priority: Major > Labels: bloom-filter > > {noformat} > public boolean forEachIndex(IntPredicate consumer) { > Objects.requireNonNull(consumer, "consumer"); > int bits = shape.getNumberOfBits(); > /* > * Essentially this is computing a wrapped modulus from a > start point and an > * increment. So actually you only need two modulus > operations before the loop. > * This avoids any modulus operation inside the while loop. > It uses a long index > * to avoid overflow. > */ > long index = mod(initial, bits); > int inc = mod(increment, bits); > for (int functionalCount = 0; functionalCount < > shape.getNumberOfHashFunctions(); functionalCount++) { > if (!consumer.test((int) index)) { > return false; > } > index += inc; > index = index >= bits ? index - bits : index; > } > return true; > } > {noformat} > This can be fixed to be non zero by using a smaller modulus: > {noformat} > // Ensure non-zero increment > int inc = 1 + mod(increment, bits - 1); > {noformat} > This will prevent the 1/bits occurrence that the increment is zero and the > hash is poor (it will be a single index). > If bits == 1 then this will throw an exception. However if bits==1 this is an > edge case that can be handled immediately after obtaining the number of bits > as: > {noformat} > if (bits == 1) { > // No need to send k duplicates, the filter shape has > been constructed incorrectly. > return consumer.test(0); > } > {noformat} > https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144 > Here we have index as a long. However if the increment (which is always > positive) is subtracted we can use an int and compare to zero: > {noformat} > index -= inc; > index = index < 0 ? index + bits : index; > {noformat} > This does break a lot of unit tests. These tests have been created using > SimpleHasher knowing what the hash modulus will be. The tests should be > updated to use a FixedHasher in the test package: > {noformat} > /** > * To be used for testing only. The fixed indices must fit within the > filter shape. > */ > class FixedHasher implements Hasher { > private final int[] indices; > private int[] distinct; > FixedHasher(int... indices) { > this.indices = indices.clone(); > } > @Override > public IndexProducer indices(Shape shape) { > checkIndices(shape); > return IndexProducer.fromIndexArray(indices); > } > @Override > public IndexProducer uniqueIndices(Shape shape) { > checkIndices(shape); > int[] d = distinct; > if (d == null) { > distinct = d = Arrays.stream(indices).distinct().toArray(); > } > return IndexProducer.fromIndexArray(d); > } > private void checkIndices(Shape shape) { > // Check the hasher is OK for the shape > int bits = shape.getNumberOfBits(); > for (int i : indices) { > Assertions.assertTrue(i < bits); > } > } > } > {noformat} > This hasher will allow filters to be constructed and manipulated. Or change > it to use % bits and drop the assertion in checkIndices. > The SimpleHasher should just be tested that it outputs random indices. I > would suggest creating a few thousand randomly seeded hashers, outputting 5 > indices from each and building a histogram of the counts of each index. This > should be uniform when tested with a Chi-square test. This can be repeated > for a few different shape sizes. > Updating the filter in this way will still implement a Krisch and > Mitzenmacher hasher where: > {noformat} > g_i(x) = (h1(x) + i h2(x)) mod m > {noformat} > The only difference is the subtraction rather than addition. > The wikipedia article on [Double Hashing > (variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that > this type of hasher when used in Bloom filters will result in a collision > rate that is twice as likely as intended. > An alternative is to provide enhanced double hashing which changes the > increment each loop iteration. This would require some additional work on > wrapping the modulus of the increment. > It may be better to change the name to KMHasher and more explicitly state > that the hash functions are: > > {noformat} > g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k] > {noformat} > -- This message was sent by Atlassian Jira (v8.20.7#820007)