[jira] [Commented] (COLLECTIONS-824) BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change

Alex Herbert (Jira) Sat, 09 Jul 2022 06:43:06 -0700


    [ 
https://issues.apache.org/jira/browse/COLLECTIONS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564580#comment-17564580
 ]


Alex Herbert commented on COLLECTIONS-824:
------------------------------------------

The term is not a pure cubic addition. If you follow through the code logic the 
term is a [tetrahedral number|https://en.wikipedia.org/wiki/Tetrahedral_number]:
||i||hash||term||
|0|start|0|
|1|start - inc|0|
|2|start - inc - inc - 1|1|
|3|start - inc - inc - inc - 1 - 1 - 2|4|
|4|start - inc - inc - inc - inc - 1 - 1 - 1 - 2 - 2 - 3|10|
|5|start - inc - inc - inc - inc - inc - 1 - 1 - 1 - 1 - 2 - 2 - 2 - 3 - 3 - 
4|20|
|i|start - i * inc - (i*i*i - i) / 6|(i*i*i - i) / 6|
{quote}I have modified Shape so that it will not accept a number of functions 
greater than the number of bits.
{quote}
Are you sure you want to make the change here? This is not technically an 
invalid shape, just a poorly chosen one. A shape with no hash functions or no 
bits is invalid: this is the only exception currently checked. Other than that 
I think a Shape is OK. Note that even if the number of hash functions is above 
the number of bits, adding an item may not saturate the filter as the hash 
functions are random and can have many duplicates.

If you have updated Shape then the factory methods in Shape will have to be 
updated to state that an exception is thrown if the number of hash functions is 
larger than the number of bits for these methods: fromNMK and fromPMK.

Since this is an edge case that should not be used then updating Shape will not 
impact those who are using it correctly. But it does prevent the factory 
methods from exercising the Bloom filter equations to their limits. Note that 
in this case the probability of false positives will approach 1:
{noformat}
p = pow(1 - exp(-k / (m / n)), k)

when  k >> m    
then  exp(-k / (m / n)) -> 0
and   pow(1 - exp(-k / (m / n)), k) -> 1{noformat}

I think it better to leave Shape alone and have this limitation documented 
where it is relevant, i.e. in the hasher that is optimised to avoid modulus 
operations on the assumption that k <= m.


> BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change
> ----------------------------------------------------------------------------
>
>                 Key: COLLECTIONS-824
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-824
>             Project: Commons Collections
>          Issue Type: Improvement
>          Components: Collection
>    Affects Versions: 4.5
>            Reporter: Claude Warren
>            Assignee: Claude Warren
>            Priority: Major
>              Labels: bloom-filter
>
> {noformat}
>             public boolean forEachIndex(IntPredicate consumer) {
>                 Objects.requireNonNull(consumer, "consumer");
>                 int bits = shape.getNumberOfBits();
>                 /*
>                  * Essentially this is computing a wrapped modulus from a 
> start point and an
>                  * increment. So actually you only need two modulus 
> operations before the loop.
>                  * This avoids any modulus operation inside the while loop. 
> It uses a long index
>                  * to avoid overflow.
>                  */
>                 long index = mod(initial, bits);
>                 int inc = mod(increment, bits);
>                 for (int functionalCount = 0; functionalCount < 
> shape.getNumberOfHashFunctions(); functionalCount++) {
>                     if (!consumer.test((int) index)) {
>                         return false;
>                     }
>                     index += inc;
>                     index = index >= bits ? index - bits : index;
>                 }
>                 return true;
>             }
> {noformat}
> This can be fixed to be non zero by using a smaller modulus:
> {noformat}
>                 // Ensure non-zero increment
>                 int inc = 1 + mod(increment, bits - 1);
> {noformat}
> This will prevent the 1/bits occurrence that the increment is zero and the 
> hash is poor (it will be a single index).
> If bits == 1 then this will throw an exception. However if bits==1 this is an 
> edge case that can be handled immediately after obtaining the number of bits 
> as:
> {noformat}
>                 if (bits == 1) {
>                     // No need to send k duplicates, the filter shape has 
> been constructed incorrectly.
>                     return consumer.test(0);
>                 }
> {noformat}
> https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144
> Here we have index as a long. However if the increment (which is always 
> positive) is subtracted we can use an int and compare to zero:
> {noformat}
> index -= inc;
> index = index < 0 ? index + bits : index;
> {noformat}
> This does break a lot of unit tests. These tests have been created using 
> SimpleHasher knowing what the hash modulus will be. The tests should be 
> updated to use a FixedHasher in the test package:
> {noformat}
>     /**
>      * To be used for testing only. The fixed indices must fit within the 
> filter shape.
>      */
>     class FixedHasher implements Hasher {
>         private final int[] indices;
>         private int[] distinct;
>         FixedHasher(int... indices) {
>             this.indices = indices.clone();
>         }
>         @Override
>         public IndexProducer indices(Shape shape) {
>             checkIndices(shape);
>             return IndexProducer.fromIndexArray(indices);
>         }
>         @Override
>         public IndexProducer uniqueIndices(Shape shape) {
>             checkIndices(shape);
>             int[] d = distinct;
>             if (d == null) {
>                 distinct = d = Arrays.stream(indices).distinct().toArray();
>             }
>             return IndexProducer.fromIndexArray(d);
>         }
>         private void checkIndices(Shape shape) {
>             // Check the hasher is OK for the shape
>             int bits = shape.getNumberOfBits();
>             for (int i : indices) {
>                 Assertions.assertTrue(i < bits);
>             }
>         }
>     }
> {noformat}
> This hasher will allow filters to be constructed and manipulated. Or change 
> it to use % bits and drop the assertion in checkIndices.
> The SimpleHasher should just be tested that it outputs random indices. I 
> would suggest creating a few thousand randomly seeded hashers, outputting 5 
> indices from each and building a histogram of the counts of each index. This 
> should be uniform when tested with a Chi-square test. This can be repeated 
> for a few different shape sizes.
> Updating the filter in this way will still implement a Krisch and 
> Mitzenmacher hasher where:
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m
> {noformat}
> The only difference is the subtraction rather than addition.
> The wikipedia article on [Double Hashing 
> (variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that 
> this type of hasher when used in Bloom filters will result in a collision 
> rate that is twice as likely as intended.
> An alternative is to provide enhanced double hashing which changes the 
> increment each loop iteration. This would require some additional work on 
> wrapping the modulus of the increment.
> It may be better to change the name to KMHasher and more explicitly state 
> that the hash functions are:
>  
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (COLLECTIONS-824) BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change

Reply via email to