[GitHub] [commons-collections] aherbert commented on a change in pull request #258: Simplify bloom filters

GitBox Sat, 12 Feb 2022 00:22:09 -0800


aherbert commented on a change in pull request #258:
URL: 
https://github.com/apache/commons-collections/pull/258#discussion_r805135039




##########
File path: 
src/test/java/org/apache/commons/collections4/bloomfilter/hasher/AbstractHasherTest.java
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter.hasher;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import org.apache.commons.collections4.bloomfilter.AbstractIndexProducerTest;
+import org.apache.commons.collections4.bloomfilter.IndexProducer;
+import org.apache.commons.collections4.bloomfilter.Shape;
+import org.junit.Test;
+
+public abstract class AbstractHasherTest extends AbstractIndexProducerTest {
+
+    protected abstract Hasher createHasher();
+
+    protected abstract Hasher createEmptyHasher();
+
+    /**
+     * The shape of the Hashers filters for testing.
+     * <ul>
+     *  <li>Hash functions (k) = 17
+     *  <li>Number of bits (m) = 72
+     * </ul>
+     * @return the testing shape.
+     */
+    protected final Shape getTestShape() {
+        return Shape.fromKM(17, 72);
+    }
+
+    @Override
+    protected IndexProducer createProducer() {
+        return createHasher().indices(getTestShape());
+    }
+
+    @Override
+    protected IndexProducer createEmptyProducer() {
+        return createEmptyHasher().indices(getTestShape());
+    }
+
+    @Test
+    public void testSize() {
+        assertEquals(1, createHasher().size());
+        assertEquals(0, createEmptyHasher().size());
+    }
+
+    @Test
+    public void testIsEmpty() {
+        assertFalse(createHasher().isEmpty());
+        assertTrue(createEmptyHasher().isEmpty());
+    }
+}

Review comment:
       The seems to more closely match the name `HasherSet` not 
`SingleItemHasherCollection`. You are aware that the Hasher will return up to 
`n * k` indices. However the 'set' can be added and removed from a filter 
atomically. Another name would be `UniqueIndicesHasherCollection`.
   
   My main issue with this is that the `size()` method is misleading. I would 
interpret this to represent the number of items added to my BloomFilter. So 
currently the method is returning 1 when the hasher will return `max(n * k)` 
indices and populate a filter as if adding `n` items. A more accurate 
description of the size method is the maximum count that can be returned for an 
individual index when the hasher `uniqueIndices` method is invoked. However if 
I am not mistaken the size method is used in other parts of the code to compute 
optimal storage for filters, i.e. it is used as if the computation `size * k` 
is the maximum number of indices. So this is contradictory.
   
   If the hasher serves only to eliminate duplicate indices to allow atomic 
insertion and removal from a counting Bloom filter then performance should be 
the same as creating a SimpleBloomFilter/SparseBloomFilter and inserting that. 
This is because the code to create `uniqueIndices` will use either a `long[]` 
or a `TreeSet<Integer>` to eliminate duplicates.
   
   So at the moment the `SingleItemHasherCollection` is reporting the size 
incorrectly (bad), or differently from other hashers, and providing no 
performance benefit over merging a Bloom filter for atomic addition/removal. I 
say reporting the size incorrectly is bad as an end user of a Hasher interface 
may wish to do some approximations based on the input number of items. Passing 
a Hasher that reports itself incorrectly is misleading.
   
   I also note that a counting Bloom filter will still be constructed based on 
the same number of bits computations as a regular Bloom filter. As such its FP 
rate will be identical to that of a standard Bloom filter, and that FP rate is 
based on the number items `n` and the number of hash functions `k`. So the size 
reported by a hasher collection should report `n`. A second method 
`indexCardinality` could be added to the Hasher interface for the purpose used 
by the SingleItemHasherCollection:
   
   ```Java
       /**
        * Gets the maximum count returned for an index when the {@code 
IndexProducer}
        * is created using the {@link #uniqueIndices(Shape)} method.
        *
        * @return the index cardinality
        */
       int indexCardinality();
   ```
   
   This would allow a receiver of a Hasher to understand what will be generated 
by the unique indices method. This method may be useful anyway as a 
HasherCollection can generate the same index multiple times which is confusing 
without extra documentation. The size method is then used to report the number 
of items represented by the hasher and used to compute the maximum number of 
indices as `size * k`. Since there seems to be two options here the method 
could be a boolean instead:
   
   ```Java
       /**
        * Returns true if the {@code IndexProducer} created using the
        * {@link #uniqueIndices(Shape)} method will generate unique indices
        * per item. If false then the indices will be unique across the entire
        * set of items.
        *
        * @return true if each item generates unique indices
        */
       default boolean uniqueIndexPerItem() {
           return true;
       }
   ```
   
   However there is a lot of chance here for implementations to get it wrong. 
As such the current interface should be updated to document that 
`uniqueIndices` returns unique indices per item, as reported by the `size` 
method.
   
   
   If the intended purpose is to simplify code that wishes to atomically merge 
composite hashers then this should be made very clear in the javadoc that this 
object exists for that purpose:
   
   --- cut ---
   
   - The `HasherCollection` is a composite Hasher that functions as a _bag_ of 
individual hashers. A bag represents a count for each index. If several hashers 
in the collection generate the same index then the `uniqueIndices` method will 
return the index multiple times.
   - The `SingleItemHasherCollection` is a composite Hasher that functions as a 
_set_ of individual hashers. A set represents an occurrence of each index. If 
several hashers in the collection generate the same index then the 
`uniqueIndices` method will return the index once. This is a convenience class 
that allows a composite hasher to be atomically inserted and removed from a 
counting Bloom filter.
   
   Note:
   
   A standard HasherCollection added to a counting Bloom filter is functionally 
identical to adding each Hasher individually. Any Hasher in the collection can 
be individually removed. A SingleItemHasherCollection added to a counting Bloom 
filter is functionally identical to creating a Bloom filter from the collection 
and adding it to a counting Bloom filter. Removal of an individual Hasher from 
the collection will corrupt the counts.
   
   --- cut ---
   
   However this level of specialisation for the sole purpose of use with 
counting Bloom filters seems unnecessary. I would drop the class.
   
   It may be useful to put some of this explanation somewhere in the javadoc. 
Perhaps added to the counting Bloom filter to ensure that it is clear that an 
"item" added to the counting Bloom filter such as a Bloom filter constructed of 
several Hashers must be removed as the same "item". Attempts to remove only 
part of an "item" such as a single Hasher used to create the composite Bloom 
filter will result in corrupt counts.
   
   Then add some more documentation to the Hasher interface to capture:
   
   - size represents the number of items. The maximum number of indices output 
for a shape is `n * k`
   - uniqueIndices generates a unique index per item
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [commons-collections] aherbert commented on a change in pull request #258: Simplify bloom filters

Reply via email to