[GitHub] [commons-collections] Claude-at-Instaclustr commented on a change in pull request #258: Simplify bloom filters

GitBox Wed, 09 Feb 2022 06:58:53 -0800


Claude-at-Instaclustr commented on a change in pull request #258:
URL: 
https://github.com/apache/commons-collections/pull/258#discussion_r802752899




##########
File path: 
src/main/java/org/apache/commons/collections4/bloomfilter/hasher/HasherCollection.java
##########
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter.hasher;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import java.util.function.IntPredicate;
+
+import org.apache.commons.collections4.bloomfilter.IndexProducer;
+import org.apache.commons.collections4.bloomfilter.Shape;
+
+/**
+ * A collection of Hashers.  Useful when the generation of a Bloom filter 
depends upon
+ * multiple items.
+ * <p>
+ * Hashers for each item are added to the HasherCollection and then
+ * the collection is used wherever a Hasher can be used in the API.
+ * </p>
+ * @since 4.5
+ */
+public class HasherCollection implements Hasher {
+
+    /**
+     * The list of hashers to be used to generate the indices.
+     */
+    private final List<Hasher> hashers;
+
+    /**
+     * Constructs an empty HasherCollection.
+     */
+    public HasherCollection() {
+        this.hashers = new ArrayList<>();
+    }
+
+    /**
+     * Constructs a HasherCollection from a collection of Hasher objects.
+     *
+     * @param hashers A collections of Hashers to build the indices with.
+     */
+    public HasherCollection(final Collection<Hasher> hashers) {
+        Objects.requireNonNull(hashers, "hashers");
+        this.hashers = new ArrayList<>(hashers);
+    }
+
+    /**
+     * Constructor.
+     *
+     * @param hashers A list of Hashers to initialize the collection with.
+     */
+    public HasherCollection(Hasher... hashers) {
+        this(Arrays.asList(hashers));
+    }
+
+    /**
+     * Adds a hasher to the collection.
+     * @param hasher The hasher to add.
+     */
+    public void add(Hasher hasher) {
+        Objects.requireNonNull(hasher, "hasher");
+        hashers.add(hasher);
+    }
+
+    /**
+     * Add all the Hashers in a collection to this HasherCollection.
+     * @param hashers The hashers to add.
+     */
+    public void add(Collection<Hasher> hashers) {
+        Objects.requireNonNull(hashers, "hashers");
+        this.hashers.addAll(hashers);
+    }
+
+    @Override
+    public IndexProducer indices(final Shape shape) {
+        Objects.requireNonNull(shape, "shape");
+        return new IndexProducer() {
+            @Override
+            public boolean forEachIndex(IntPredicate consumer) {
+                for (Hasher hasher : hashers) {
+                    if (!hasher.indices(shape).forEachIndex(consumer)) {

Review comment:
       I think there is a misunderstanding of what the HasherCollection does.  
   
   ```
   HasherCollection hc = new HasherCollection( hasher1, hasher2 );
   bloomFilter.merge( hc );
   ```
   is equivalent to
   ```
   bloomFilter.merge( hasher1 );
   bloomFilter.merge(hasher2) ;
   ```
   So it is expected to send duplicates.  Let's explore the ieda of removing 
the requirement to eliminate duplicates.
   
   - If there are duplicates there are no issues for the standard Bloom filters.
   - Specialized filters (like the Counting bloom filter) can remove duplicates 
from most hashers.
   - Collections of filters cause problems for specialized filters.  
   - There are 2 types of collections:  
       1. HasherCollection which looks like a bunch of single hashers.  It 
actually reports the number of items as the number of hashers in the collecton. 
(use case: creating a series of hashers that can then be merged into multiple 
filters of differens shapes using the simple merge() or mergeInPlace() methods)
       2. SingleItemHasherCollection which is a HasherCollection that reports 
as a single item. (use case: used in collections where a group of simple 
hashers are to be considered as a single item.  Example when I have some code 
that builds reference bloom filters for geonames data.  Each filter comprises 
the name, country_code and feature_code.  That code produces a 
HasherCollection.  if I want to take multiple HasherCollections and feed them 
into a counting filter I have to do it one at a time or I can create a 
SingleItemHasherCollection for each geonames item.  The place those into a 
HasherCollection and pass that to the counting filter.)
   - The goal is to have each single (non collection) hasher appear to the 
counting filter as though it were a Bloom filter without the overhead of 
creating the Bloom filter.  For single hashers this would be adding the Filter. 
 For collections a clean filter needs to be applied to each of the enclosed 
hashers. (clean meaning has seen no values yet).
   
   Perhaps the solution is to add a method to Hasher `indices( Shape, Filter)` 
where Filter is 
   
   ```
   class Filter {
       boolean test(int);
       void reset() { /* no op by default */ }
   }
   ```
   This would allows the HasherCollection to push the filter down to the 
encosed hashers and perform a reset after each use.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [commons-collections] Claude-at-Instaclustr commented on a change in pull request #258: Simplify bloom filters

Reply via email to