[jira] [Updated] (COLLECTIONS-824) BloomFilter: change name of SimpleHasher

Claude Warren (Jira) Wed, 15 Jun 2022 01:45:06 -0700


     [ 
https://issues.apache.org/jira/browse/COLLECTIONS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Claude Warren updated COLLECTIONS-824:
--------------------------------------
    Description: 

{noformat}

            public boolean forEachIndex(IntPredicate consumer) {
                Objects.requireNonNull(consumer, "consumer");
                int bits = shape.getNumberOfBits();
                /*
                 * Essentially this is computing a wrapped modulus from a start 
point and an
                 * increment. So actually you only need two modulus operations 
before the loop.
                 * This avoids any modulus operation inside the while loop. It 
uses a long index
                 * to avoid overflow.
                 */
                long index = mod(initial, bits);
                int inc = mod(increment, bits);


                for (int functionalCount = 0; functionalCount < 
shape.getNumberOfHashFunctions(); functionalCount++) {


                    if (!consumer.test((int) index)) {
                        return false;
                    }
                    index += inc;
                    index = index >= bits ? index - bits : index;
                }
                return true;
            }

{noformat}


This can be fixed to be non zero by using a smaller modulus:

{noformat}

                // Ensure non-zero increment
                int inc = 1 + mod(increment, bits - 1);
{noformat}


This will prevent the 1/bits occurrence that the increment is zero and the hash 
is poor (it will be a single index).

If bits == 1 then this will throw an exception. However if bits==1 this is an 
edge case that can be handled immediately after obtaining the number of bits as:

{noformat}

                if (bits == 1) {
                    // No need to send k duplicates, the filter shape has been 
constructed incorrectly.
                    return consumer.test(0);
                }

{noformat}




https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144

Here we have index as a long. However if the increment (which is always 
positive) is subtracted we can use an int and compare to zero:

{noformat}

index -= inc;
index = index < 0 ? index + bits : index;

{noformat}

This does break a lot of unit tests. These tests have been created using 
SimpleHasher knowing what the hash modulus will be. The tests should be updated 
to use a FixedHasher in the test package:


{noformat}
    /**
     * To be used for testing only. The fixed indices must fit within the 
filter shape.
     */
    class FixedHasher implements Hasher {
        private final int[] indices;
        private int[] distinct;

        FixedHasher(int... indices) {
            this.indices = indices.clone();
        }

        @Override
        public IndexProducer indices(Shape shape) {
            checkIndices(shape);
            return IndexProducer.fromIndexArray(indices);
        }

        @Override
        public IndexProducer uniqueIndices(Shape shape) {
            checkIndices(shape);
            int[] d = distinct;
            if (d == null) {
                distinct = d = Arrays.stream(indices).distinct().toArray();
            }
            return IndexProducer.fromIndexArray(d);
        }

        private void checkIndices(Shape shape) {
            // Check the hasher is OK for the shape
            int bits = shape.getNumberOfBits();
            for (int i : indices) {
                Assertions.assertTrue(i < bits);
            }
        }
    }

{noformat}

This hasher will allow filters to be constructed and manipulated. Or change it 
to use % bits and drop the assertion in checkIndices.

The SimpleHasher should just be tested that it outputs random indices. I would 
suggest creating a few thousand randomly seeded hashers, outputting 5 indices 
from each and building a histogram of the counts of each index. This should be 
uniform when tested with a Chi-square test. This can be repeated for a few 
different shape sizes.

Updating the filter in this way will still implement a Krisch and Mitzenmacher 
hasher where:

{noformat}

g_i(x) = (h1(x) + i h2(x)) mod m
{noformat}


The only difference is the subtraction rather than addition.



The wikipedia article on [Double Hashing 
(variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that 
this type of hasher when used in Bloom filters will result in a collision rate 
that is twice as likely as intended.

An alternative is to provide enhanced double hashing which changes the 
increment each loop iteration. This would require some additional work on 
wrapping the modulus of the increment.

It may be better to change the name to KMHasher and more explicitly state that 
the hash functions are:


 


{noformat}
g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
{noformat}

 

  was:
The wikipedia article on [Double Hashing 
(variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that 
this type of hasher when used in Bloom filters will result in a collision rate 
that is twice as likely as intended.

An alternative is to provide enhanced double hashing which changes the 
increment each loop iteration. This would require some additional work on 
wrapping the modulus of the increment.

It may be better to change the name to KMHasher and more explicitly state that 
the hash functions are:


 


{noformat}
g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
{noformat}

 


> BloomFilter: change name of SimpleHasher
> ----------------------------------------
>
>                 Key: COLLECTIONS-824
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-824
>             Project: Commons Collections
>          Issue Type: Improvement
>          Components: Collection
>    Affects Versions: 4.5
>            Reporter: Claude Warren
>            Priority: Major
>              Labels: bloom-filter
>
> {noformat}
>             public boolean forEachIndex(IntPredicate consumer) {
>                 Objects.requireNonNull(consumer, "consumer");
>                 int bits = shape.getNumberOfBits();
>                 /*
>                  * Essentially this is computing a wrapped modulus from a 
> start point and an
>                  * increment. So actually you only need two modulus 
> operations before the loop.
>                  * This avoids any modulus operation inside the while loop. 
> It uses a long index
>                  * to avoid overflow.
>                  */
>                 long index = mod(initial, bits);
>                 int inc = mod(increment, bits);
>                 for (int functionalCount = 0; functionalCount < 
> shape.getNumberOfHashFunctions(); functionalCount++) {
>                     if (!consumer.test((int) index)) {
>                         return false;
>                     }
>                     index += inc;
>                     index = index >= bits ? index - bits : index;
>                 }
>                 return true;
>             }
> {noformat}
> This can be fixed to be non zero by using a smaller modulus:
> {noformat}
>                 // Ensure non-zero increment
>                 int inc = 1 + mod(increment, bits - 1);
> {noformat}
> This will prevent the 1/bits occurrence that the increment is zero and the 
> hash is poor (it will be a single index).
> If bits == 1 then this will throw an exception. However if bits==1 this is an 
> edge case that can be handled immediately after obtaining the number of bits 
> as:
> {noformat}
>                 if (bits == 1) {
>                     // No need to send k duplicates, the filter shape has 
> been constructed incorrectly.
>                     return consumer.test(0);
>                 }
> {noformat}
> https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144
> Here we have index as a long. However if the increment (which is always 
> positive) is subtracted we can use an int and compare to zero:
> {noformat}
> index -= inc;
> index = index < 0 ? index + bits : index;
> {noformat}
> This does break a lot of unit tests. These tests have been created using 
> SimpleHasher knowing what the hash modulus will be. The tests should be 
> updated to use a FixedHasher in the test package:
> {noformat}
>     /**
>      * To be used for testing only. The fixed indices must fit within the 
> filter shape.
>      */
>     class FixedHasher implements Hasher {
>         private final int[] indices;
>         private int[] distinct;
>         FixedHasher(int... indices) {
>             this.indices = indices.clone();
>         }
>         @Override
>         public IndexProducer indices(Shape shape) {
>             checkIndices(shape);
>             return IndexProducer.fromIndexArray(indices);
>         }
>         @Override
>         public IndexProducer uniqueIndices(Shape shape) {
>             checkIndices(shape);
>             int[] d = distinct;
>             if (d == null) {
>                 distinct = d = Arrays.stream(indices).distinct().toArray();
>             }
>             return IndexProducer.fromIndexArray(d);
>         }
>         private void checkIndices(Shape shape) {
>             // Check the hasher is OK for the shape
>             int bits = shape.getNumberOfBits();
>             for (int i : indices) {
>                 Assertions.assertTrue(i < bits);
>             }
>         }
>     }
> {noformat}
> This hasher will allow filters to be constructed and manipulated. Or change 
> it to use % bits and drop the assertion in checkIndices.
> The SimpleHasher should just be tested that it outputs random indices. I 
> would suggest creating a few thousand randomly seeded hashers, outputting 5 
> indices from each and building a histogram of the counts of each index. This 
> should be uniform when tested with a Chi-square test. This can be repeated 
> for a few different shape sizes.
> Updating the filter in this way will still implement a Krisch and 
> Mitzenmacher hasher where:
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m
> {noformat}
> The only difference is the subtraction rather than addition.
> The wikipedia article on [Double Hashing 
> (variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that 
> this type of hasher when used in Bloom filters will result in a collision 
> rate that is twice as likely as intended.
> An alternative is to provide enhanced double hashing which changes the 
> increment each loop iteration. This would require some additional work on 
> wrapping the modulus of the increment.
> It may be better to change the name to KMHasher and more explicitly state 
> that the hash functions are:
>  
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (COLLECTIONS-824) BloomFilter: change name of SimpleHasher

Reply via email to