[jira] [Commented] (COLLECTIONS-824) BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change

Claude Warren (Jira) Sat, 09 Jul 2022 03:18:12 -0700


    [ 
https://issues.apache.org/jira/browse/COLLECTIONS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564564#comment-17564564
 ]


Claude Warren commented on COLLECTIONS-824:
-------------------------------------------

I have modified Shape so that it will not accept a number of functions greater 
than the number of bits.

I have changed the forEachIndex to be:

 
{code:java}
           @Override
            public boolean forEachIndex(IntPredicate consumer) {
                Objects.requireNonNull(consumer, "consumer");
                final int bits = shape.getNumberOfBits();
                // Enhanced double hashing:
                //   hash[i] = ( h1(x) + i*h2(x) + (i*i*i - i)/6 ) mod bits
                // See: 
https://en.wikipedia.org/wiki/Double_hashing#Enhanced_double_hashing
                //
                // Essentially this is computing a wrapped modulus from a start 
point and an
                // increment and an additional term as a tetrahedral number.
                // You only need two modulus operations before the loop. Within 
the loop
                // the modulus is handled using the sign bit to detect wrapping 
to ensure:
                // 0 <= index < bits
                // 0 <= inc < bits
                // The final hash is:
                //   hash[i] = ( h1(x) - i*h2(x) - (i*i*i - i)/6 ) wrapped in 
[0, bits)

                int index = mod(initial, bits);
                int inc = mod(increment, bits);

                final int k = shape.getNumberOfHashFunctions();
                for (int i = 0; i < k; i++) {
                    if (!consumer.test(index)) {
                        return false;
                    }
                    // Update index and handle wrapping
                    index -= inc;
                    index = index < 0 ? index + bits : index;

                    // Incorporate the counter into the increment to create a
                    // tetrahedral number additional term, and handle wrapping.
                    inc -= i;
                    inc = inc < 0 ? inc + bits : inc;
                }
                return true;
            }
 {code}
But I think the comments are wrong.  As I understand the code and the wikipedia 
article you cited, this is adding a cubic term rather than a tetrahedral 
number.  If you agree I will modify the comments appropriately.

> BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change
> ----------------------------------------------------------------------------
>
>                 Key: COLLECTIONS-824
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-824
>             Project: Commons Collections
>          Issue Type: Improvement
>          Components: Collection
>    Affects Versions: 4.5
>            Reporter: Claude Warren
>            Assignee: Claude Warren
>            Priority: Major
>              Labels: bloom-filter
>
> {noformat}
>             public boolean forEachIndex(IntPredicate consumer) {
>                 Objects.requireNonNull(consumer, "consumer");
>                 int bits = shape.getNumberOfBits();
>                 /*
>                  * Essentially this is computing a wrapped modulus from a 
> start point and an
>                  * increment. So actually you only need two modulus 
> operations before the loop.
>                  * This avoids any modulus operation inside the while loop. 
> It uses a long index
>                  * to avoid overflow.
>                  */
>                 long index = mod(initial, bits);
>                 int inc = mod(increment, bits);
>                 for (int functionalCount = 0; functionalCount < 
> shape.getNumberOfHashFunctions(); functionalCount++) {
>                     if (!consumer.test((int) index)) {
>                         return false;
>                     }
>                     index += inc;
>                     index = index >= bits ? index - bits : index;
>                 }
>                 return true;
>             }
> {noformat}
> This can be fixed to be non zero by using a smaller modulus:
> {noformat}
>                 // Ensure non-zero increment
>                 int inc = 1 + mod(increment, bits - 1);
> {noformat}
> This will prevent the 1/bits occurrence that the increment is zero and the 
> hash is poor (it will be a single index).
> If bits == 1 then this will throw an exception. However if bits==1 this is an 
> edge case that can be handled immediately after obtaining the number of bits 
> as:
> {noformat}
>                 if (bits == 1) {
>                     // No need to send k duplicates, the filter shape has 
> been constructed incorrectly.
>                     return consumer.test(0);
>                 }
> {noformat}
> https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144
> Here we have index as a long. However if the increment (which is always 
> positive) is subtracted we can use an int and compare to zero:
> {noformat}
> index -= inc;
> index = index < 0 ? index + bits : index;
> {noformat}
> This does break a lot of unit tests. These tests have been created using 
> SimpleHasher knowing what the hash modulus will be. The tests should be 
> updated to use a FixedHasher in the test package:
> {noformat}
>     /**
>      * To be used for testing only. The fixed indices must fit within the 
> filter shape.
>      */
>     class FixedHasher implements Hasher {
>         private final int[] indices;
>         private int[] distinct;
>         FixedHasher(int... indices) {
>             this.indices = indices.clone();
>         }
>         @Override
>         public IndexProducer indices(Shape shape) {
>             checkIndices(shape);
>             return IndexProducer.fromIndexArray(indices);
>         }
>         @Override
>         public IndexProducer uniqueIndices(Shape shape) {
>             checkIndices(shape);
>             int[] d = distinct;
>             if (d == null) {
>                 distinct = d = Arrays.stream(indices).distinct().toArray();
>             }
>             return IndexProducer.fromIndexArray(d);
>         }
>         private void checkIndices(Shape shape) {
>             // Check the hasher is OK for the shape
>             int bits = shape.getNumberOfBits();
>             for (int i : indices) {
>                 Assertions.assertTrue(i < bits);
>             }
>         }
>     }
> {noformat}
> This hasher will allow filters to be constructed and manipulated. Or change 
> it to use % bits and drop the assertion in checkIndices.
> The SimpleHasher should just be tested that it outputs random indices. I 
> would suggest creating a few thousand randomly seeded hashers, outputting 5 
> indices from each and building a histogram of the counts of each index. This 
> should be uniform when tested with a Chi-square test. This can be repeated 
> for a few different shape sizes.
> Updating the filter in this way will still implement a Krisch and 
> Mitzenmacher hasher where:
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m
> {noformat}
> The only difference is the subtraction rather than addition.
> The wikipedia article on [Double Hashing 
> (variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that 
> this type of hasher when used in Bloom filters will result in a collision 
> rate that is twice as likely as intended.
> An alternative is to provide enhanced double hashing which changes the 
> increment each loop iteration. This would require some additional work on 
> wrapping the modulus of the increment.
> It may be better to change the name to KMHasher and more explicitly state 
> that the hash functions are:
>  
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (COLLECTIONS-824) BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change

Reply via email to