[GitHub] [commons-collections] aherbert commented on a change in pull request #258: Simplify bloom filters

GitBox Wed, 09 Feb 2022 13:52:23 -0800


aherbert commented on a change in pull request #258:
URL: 
https://github.com/apache/commons-collections/pull/258#discussion_r803118772




##########
File path: 
src/main/java/org/apache/commons/collections4/bloomfilter/hasher/SimpleHasher.java
##########
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter.hasher;
+
+import java.util.Objects;
+import java.util.function.IntPredicate;
+
+import org.apache.commons.collections4.bloomfilter.IndexProducer;
+import org.apache.commons.collections4.bloomfilter.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.filter.Filter;
+
+/**
+ * A Hasher that implements combinatorial hashing as as described by
+ * <a 
href="https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf";>Krisch 
amd Mitzenmacher</a>.
+ * <p>
+ * Common use for this hasher is to generate a byte array as the output of a 
hashing
+ * or MessageDigest algorithm.</p>
+ *
+ * @since 4.5
+ */
+public final class SimpleHasher implements Hasher {
+    /**
+     * A default increment used when the requested increment is zero. This is 
the same
+     * default increment used in Java's SplittableRandom random number 
generator.  It is the
+     * fractional representation of the golden ratio (0.618...) with a base of 
2^64.
+     */
+    public static final long DEFAULT_INCREMENT = 0x9e3779b97f4a7c15L;
+
+    /**
+     * This mask is used to obtain the value of an int as if it were unsigned.
+     */
+    private static final long LONG_MASK = 0xffffffffL;
+
+    /**
+     * The initial hash value.
+     */
+    private final long initial;
+
+    /**
+     * The value to increment the hash value by.
+     */
+    private final long increment;
+
+    /**
+     * Convert bytes to long.
+     * @param byteArray the byte array to extract the values from.
+     * @param offset the offset to start extraction from.
+     * @param len the length of the extraction, may be longer than 8.
+     * @return
+     */
+    private static long toLong(byte[] byteArray, int offset, int len) {
+        long val = 0;
+        len = Math.min(len, Long.BYTES);
+        for (int i = 0; i < len; i++) {
+            val <<= 8;
+            val |= (byteArray[offset + i] & 0x00FF);
+        }
+        return val;
+    }
+
+    /**
+     * Constructs the SimpleHasher from a byte array.
+     * <p>The byte array is split in 2 and each half is interpreted as a long 
value.
+     * Excess bytes are ignored.  This simplifies the conversion from a Digest 
or hasher algorithm output
+     * to the two values used by the SimpleHasher.</p>
+     * <p><em>If the second long is zero the DEFAULT_INCREMENT is used 
instead.</em></p>
+     * @param buffer the buffer to extract the longs from.
+     * @throws IllegalArgumentException is buffer length is zero.
+     */
+    public SimpleHasher(byte[] buffer) {
+        if (buffer.length == 0) {
+            throw new IllegalArgumentException("buffer length must be greater 
than 0");
+        }
+        int segment = buffer.length / 2;
+        this.initial = toLong(buffer, 0, segment);
+        long possibleIncrement = toLong(buffer, segment, buffer.length - 
segment);
+        this.increment = possibleIncrement == 0 ? DEFAULT_INCREMENT : 
possibleIncrement;
+    }
+
+    /**
+     * Constructs the SimpleHasher from 2 longs.  The long values will be 
interpreted as unsigned values.
+     * <p><em>If the increment is zero the DEFAULT_INCREMENT is used 
instead.</em></p>
+     * @param initial The initial value for the hasher.
+     * @param increment The value to increment the hash by on each iteration.
+     */
+    public SimpleHasher(long initial, long increment) {
+        this.initial = initial;
+        this.increment = increment == 0 ? DEFAULT_INCREMENT : increment;
+    }
+
+    /**
+     * This method divides a long < 0 as an unsigned long and returns the 
remainder.
+     *
+     * @param dividend the number to divide.
+     * @param divisor the divisor
+     * @return the remainder
+     */
+    private static int remainder(long dividend, int divisor) {
+        long longDivisor = divisor & LONG_MASK;
+        long remainder;
+        long quotient;
+        if (divisor == 1) {
+            return 0;
+        }
+
+        // Approximate the quotient and remainder
+        quotient = (dividend >>> 1) / (longDivisor >>> 1);
+        remainder = dividend - quotient * longDivisor;
+
+        // Correct the approximation
+        while (remainder < 0) {
+            remainder += longDivisor;
+            quotient--;
+        }
+        while (remainder >= longDivisor) {
+            remainder -= longDivisor;
+            quotient++;
+        }
+        return (int) remainder;
+    }
+
+    /**
+     * Calculates the modulus of an unsigned long value and an integer divisor.
+     * @param dividend the unsigned long value to divide.
+     * @param divisor the divisor
+     * @return the remainder.
+     */
+    static int mod(long dividend, int divisor) {

Review comment:
       A bit more thought on this SimpleHasher. Essentially it is computing a 
wrapped modulus from a start point and an increment. So actually you only need 
two modulus operations before the loop. I added this to the SimpleHasher:
   
   ```Java
   public boolean forEachIndex(IntPredicate consumer) {
       Objects.requireNonNull(consumer, "consumer");
   
       // Put these in the same package. Make them package private?
       Filter filter = new Filter(shape, consumer);
   
       int bits = shape.getNumberOfBits();
   
       // *** New ***
       // Set up for the modulus. Use a long index to avoid overflow.
       long index2 = mod(next, bits);
       int inc = mod(increment, bits);
   
       for (int functionalCount = 0; functionalCount < 
shape.getNumberOfHashFunctions(); functionalCount++) {
           int index = mod(next, bits);
   
           // Check the fast modulus wrap works
           assert index == index2;
           index2 += inc;
           index2 = index2 >= bits ? index2 - bits : index2;
   
           if (!filter.test(index)) {
               // reset next
               next = SimpleHasher.this.initial;
               return false;
           }
           next += SimpleHasher.this.increment;
       }
       // reset next
       next = SimpleHasher.this.initial;
       return true;
   }
   ```
   
   The entire test suite passes. Thus the `index2` can be used and avoids any 
modulus operation inside the while loop. A conditional load to check for 
wrapping is a fast instruction and I would guess this would be faster than the 
long modulus on each loop iteration. So we can add it to the benchmark.
   
   An alternative to generating an integer in a range is to use a biased 
sampling algorithm for random numbers. This requires a set of random bits. Here 
is the explanation for 32-bits:
   ```Java
    static int makeIntInRange(int value, int n) {
       // n must be positive !!!
   
       // This computes the rounded fraction n * v / 2^32.
       // If v is an unsigned integer in the range 0 to 2^32 -1 then v / 2^32
       // is a uniform deviate in the range [0,1).
       // This is possible using unsigned integer arithmetic with long.
       // A division by 2^32 is a bit shift of 32.
       return (int) ((n * (value & INT_TO_UNSIGNED_BYTE_MASK)) >>> 32);
     }
   ```
   
   This would be extremely fast. The algorithm is biased as there is rounding 
error. However if all possible values are input as `value` the output histogram 
of counts for all possible samples will be either `j` or `j+1` counts for each 
value. You can work out the bias formally but I won't bother here. When the 
number of hash functions k is small any sampling bias will be irrelevant. The 
bias is exactly the same bias as if using `long % int`, however the output 
integers are a linear mapping and not a series wrapping (see below).
   
   This can be adapted to use 64-bit numbers but is only fast if supported with 
a hardware based 64-bit multiply to 128-bits as in JDK 11 
[Math.multiplyHigh](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Math.html#multiplyHigh(long,long)).
   
   The other issue is that the output integer in the range is a linear mapping 
of the unsigned input value. So if you input a range of values built with a 
small increment all the output integers could map to the same value. This would 
thus require a true random int value from the SimpleHasher and not just the 
simple [Weyl sequence](https://en.wikipedia.org/wiki/Weyl_sequence) output. 
This can be done using RNG methods and I added it to the benchmark.
   
   Here is the extra mod function you suggested in the benchmark (added as 
`mod5`). I added the wrapping modulus operation as `mod6`.  Note I generate a 
random long value and mix it with the index so that the method has the same 
baseline time. However this is not a totally fair test as the generator has a 
setup cost that is not included in the timing. A fairer test would be of a 
complete hasher being seeded with two random long values and outputting a 
configurable k indices. The alternative RNG method is `rng2`:
   ```
   (divisor)                (method)  Mode  Cnt   Score   Error  Units
       35677                baseline  avgt    5   2.429 ± 0.022  ns/op
       35677       remainderUnsigned  avgt    5  13.725 ± 0.069  ns/op
       35677  remainderUnsignedJdk11  avgt    5  54.432 ± 1.291  ns/op
       35677                   shift  avgt    5  12.332 ± 0.418  ns/op
       35677                    mod1  avgt    5  26.947 ± 2.220  ns/op
       35677                    mod2  avgt    5  27.527 ± 1.164  ns/op
       35677                    mod3  avgt    5  13.591 ± 0.192  ns/op
       35677                    mod4  avgt    5  13.284 ± 0.245  ns/op
       35677                    mod5  avgt    5  12.537 ± 0.077  ns/op
       35677                    mod6  avgt    5   3.045 ± 0.137  ns/op
       35677                     rng  avgt    5   4.796 ± 0.071  ns/op
       35677                    rng2  avgt    5   2.531 ± 0.050  ns/op
   ```
   
   So `mod5` is a tiny bit faster than the true unsignedRemainder method 
(`mod4`). However `mod6` is extremely fast. If the SimpleHasher is to use 
Krisch and Mitzenmacher's method then this is the implementation we should use. 
It would be a drop in replacement for what you have and the test suite can 
remain unchanged.
   
   `rng2` is faster again. It will have a truly unpredictable output if the 
SimpleHasher uses a fast RNG method to output the int values required (here I 
used SplittableRandom but there are faster generators with 128-bits of state we 
could use instead from Commons RNG). Using a configurable RNG for the hasher 
could maybe be added later with an optional runtime dependency on Commons RNG. 
This provides a factory to convert `byte[]` into a properly seeded RNG. So it 
could be added to support hashing of byte digests of different lengths by 
generating random numbers from the seed.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [commons-collections] aherbert commented on a change in pull request #258: Simplify bloom filters

Reply via email to