[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

via GitHub Tue, 01 Aug 2023 16:52:42 -0700


aherbert commented on code in PR #406:
URL: 
https://github.com/apache/commons-collections/pull/406#discussion_r1281192120



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The 
term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the 
equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be 
ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one 
matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not 
output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for 
the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count 
is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit 
flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The 
consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, 
{@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code 
false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all 
cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a 
cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the 
IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be 
created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx 
).toArray();
+            }
+
+            @Override
+            public boolean forEachCell(CellConsumer consumer) {
+                populate();
+                for (CounterCell cell : counterCells.values()) {
+                    if (!consumer.test(cell.idx, cell.count) ) {
+                        return false;
+                    }
+                }
+                return true;
+            }
+
+            /**
+             * Class to track cell values in the TreeMap.
+             */
+            final class CounterCell implements Comparable<CounterCell> {
+                final int idx;
+                int count;
+
+                CounterCell(int idx, int count) {
+                    this.idx = idx;
+                    this.count = count;
+                }
+
+                @Override
+                public int compareTo(CounterCell other) {
+                    return Integer.compare( idx,  other.idx);
+                }
+            }
+        };
+    }
+
+    /**
+     * Represents an operation that accepts an {@code <index, cell>} pair.

Review Comment:
   Refers to `<index, cell>` pair. The rest of the interface is `<index, 
count>`.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The 
term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the 
equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be 
ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one 
matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not 
output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for 
the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count 
is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit 
flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The 
consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, 
{@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code 
false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all 
cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a 
cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the 
IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be 
created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx 
).toArray();

Review Comment:
   whitespace



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The 
term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the 
equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be 
ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one 
matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not 
output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for 
the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count 
is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit 
flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The 
consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, 
{@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code 
false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all 
cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a 
cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the 
IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be 
created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx 
).toArray();
+            }
+
+            @Override
+            public boolean forEachCell(CellConsumer consumer) {
+                populate();
+                for (CounterCell cell : counterCells.values()) {
+                    if (!consumer.test(cell.idx, cell.count) ) {
+                        return false;
+                    }
+                }
+                return true;
+            }
+
+            /**
+             * Class to track cell values in the TreeMap.
+             */
+            final class CounterCell implements Comparable<CounterCell> {
+                final int idx;
+                int count;
+
+                CounterCell(int idx, int count) {
+                    this.idx = idx;
+                    this.count = count;
+                }
+
+                @Override
+                public int compareTo(CounterCell other) {
+                    return Integer.compare( idx,  other.idx);

Review Comment:
   Whitespace



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -32,6 +33,8 @@
 @FunctionalInterface
 public interface IndexProducer {
 
+    int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

Review Comment:
   This should not be in the interface. It makes it a public variable.
   
   I suggest moving the array sizing functionality to a package private class, 
e.g.:
   ```java
   final class IndexUtils {
       // Ensure the array can add an element at the specified index
       static int[] ensureCapacityForAdd(int[] array, int index) {
           // ...
       }
   }
   ```
   There is no other static helper class suitable, so just create an 
appropriately named class.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexFilter.java:
##########


Review Comment:
   All these changes are not related to this PR. Please revert.
   
   They can be considered separately. By chaining predicates using `and`, `or` 
and `negate` you are adding a layer of obfuscation where the method calls may 
make the filter less efficient. A performance benchmark would useful here.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, 
size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.
+     *
+     * <p>This is like the `indices(Shape)` method except that it adds the 
guarantee that no
+     * duplicate values will be returned. The indices produced are equivalent 
to those returned
+     * from by a Bloom filter created from this hasher.</p>
+     *
+     * @return the iterator of integers
+     * @throws IndexOutOfBoundsException if any index is less than 0
+     */
+    default IndexProducer uniqueIndices() {
+        final BitSet bitSet = new BitSet();
         forEachIndex(i -> {
-            result.set(i);
+            bitSet.set(i);
             return true;
         });
-        return result.stream().toArray();
+
+        return new IndexProducer() {
+            @Override
+            public boolean forEachIndex(IntPredicate predicate) {
+                for (int idx = bitSet.nextSetBit(0); idx >= 0; idx = 
bitSet.nextSetBit(idx+1)) {

Review Comment:
   `idx + 1`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, 
size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.
+     *
+     * <p>This is like the `indices(Shape)` method except that it adds the 
guarantee that no

Review Comment:
   Since you moved this from Hasher it refers to a non-existent method 
`indices(Shape)`. This paragraph also refers to `this hasher`
   



##########
src/main/java/org/apache/commons/collections4/bloomfilter/BloomFilter.java:
##########
@@ -309,4 +309,12 @@ default int estimateIntersection(final BloomFilter other) {
         }
         return estimate>Integer.MAX_VALUE?Integer.MAX_VALUE:(int) estimate;
     }
+
+    /**
+     * Most Bloom filter's create unique IndexProducers.

Review Comment:
   `filter's` to `filters`
   
   This is low quality javadoc.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, 
size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.

Review Comment:
   Creates an IndexProducer that returns the same indices as this instance, 
removing duplicates so that each index will be output only once.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########


Review Comment:
   This file has lost the git history from `AbstractBitCountProducerTest`. It 
is showing as a new file and the old test as deleted.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########


Review Comment:
   This file has lost the git history from `BitCountProducer`. It is showing as 
a new file and the old interface as deleted.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, 
size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.
+     *
+     * <p>This is like the `indices(Shape)` method except that it adds the 
guarantee that no
+     * duplicate values will be returned. The indices produced are equivalent 
to those returned
+     * from by a Bloom filter created from this hasher.</p>
+     *
+     * @return the iterator of integers
+     * @throws IndexOutOfBoundsException if any index is less than 0

Review Comment:
   This IOOB exception is only relevant to the default implementation. As such 
I would remove it from the `@throws` and document the default implementation. 
"The default implementation will filter the indices from this instance and 
return them in ascending order. Any indices less than zero are invalid and will 
generate a runtime exception."



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends 
AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected 
indices entries.
+     * @return an array of expected values.
+     * @see AbstractIndexProducerTest#getExpectedIndices()
+     */
+    protected abstract int[] getExpectedValues();
+
+    @Override
+    protected final int getAsIndexArrayBehaviour() {
+        return ORDERED | DISTINCT;
+    }
+
+    /**
+     * Creates a producer with some data.
+     * @return a producer with some data
+     */
+    @Override
+    protected abstract CellProducer createProducer();
+
+    /**
+     * Creates a producer without data.
+     * @return a producer that has no data.
+     */
+    @Override
+    protected abstract CellProducer createEmptyProducer();
+
+    @Test
+    public final void testForEachCellPredicates() {
+        final CellProducer populated = createProducer();
+        final CellProducer empty = createEmptyProducer();
+
+        assertFalse(populated.forEachCell(FALSE_CONSUMER), "non-empty should 
be false");
+        assertTrue(empty.forEachCell(FALSE_CONSUMER), "empty should be true");
+
+        assertTrue(populated.forEachCell(TRUE_CONSUMER), "non-empty should be 
true");
+        assertTrue(empty.forEachCell(TRUE_CONSUMER), "empty should be true");
+    }
+
+    @Test
+    public final void testEmptyCellProducer() {
+        final CellProducer empty = createEmptyProducer();
+        final int ary[] = empty.asIndexArray();
+        assertEquals(0, ary.length);
+        assertTrue(empty.forEachCell((i, j) -> {
+            fail("forEachCell consumer should not be called");
+            return false;
+        }));
+    }
+
+    @Test
+    public final void testIndexConsistency() {
+        final CellProducer producer = createProducer();
+        final BitSet bs1 = new BitSet();
+        final BitSet bs2 = new BitSet();
+        producer.forEachIndex(i -> {
+            bs1.set(i);
+            return true;
+        });
+        producer.forEachCell((i, j) -> {
+            bs2.set(i);
+            return true;
+        });
+        assertEquals(bs1, bs2);
+    }
+
+    @Test
+    public void testForEachCellValues() {
+        int[] expectedIdx = getExpectedIndices();
+        int[] expectedValue = getExpectedValues();
+        assertEquals( expectedIdx.length, expectedValue.length, "expected 
index length and value length do not match");
+        int[] idx = {0};
+        createProducer().forEachCell((i, j) -> {
+            assertEquals(expectedIdx[idx[0]], i, "bad index at "+idx[0]);

Review Comment:
   `" + idx`
   Same on next line



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +296,114 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes 
results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes 
results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int 
from11) {
+        BloomFilter bfFrom0 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom0.merge(new IncrementingHasher(0, 1));
+        BloomFilter bfFrom1 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom1.merge(TestingHashers.FROM1);
+        BloomFilter bfFrom11 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom11.merge(TestingHashers.FROM11);
+
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(0, 1)));
+        assertEquals(0, bf.getMaxInsert(bfFrom0));
+        assertEquals(0, bf.getMaxInsert((BitMapProducer) bfFrom0));
+        assertEquals(0, bf.getMaxInsert((IndexProducer) bfFrom0));
+
+        assertEquals(from1, bf.getMaxInsert(TestingHashers.FROM1));
+        assertEquals(from1, bf.getMaxInsert(bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((BitMapProducer) bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((IndexProducer) bfFrom1));
+
+        assertEquals(from11, bf.getMaxInsert(TestingHashers.FROM11));
+        assertEquals(from11, bf.getMaxInsert(bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((BitMapProducer) bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((IndexProducer) bfFrom11));
+    }
+
+    @Test
+    public void testGetMaxInsert() {
+        CountingBloomFilter bf = createEmptyFilter(getTestShape());
+        verifyMaxInsert(bf, 0, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 2, 0);
+        bf.merge(TestingHashers.FROM11);
+        verifyMaxInsert(bf, 2, 1);
+        bf.remove(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 1);
+        // verify remove false positive works
+        // Incrementing hasher 5,1 spans the single count cells for both FROM1 
and FROM11
+        assertEquals(1, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+        bf.remove(new IncrementingHasher(5, 1));
+        verifyMaxInsert(bf, 0, 0);
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+    }
+
+    private void assertCell3(CountingBloomFilter bf, int value) {
+        bf.forEachCell((k, v) -> {
+            if (k == 3) {
+                assertEquals(value, v, "Mismatch at position (3) "+k);
+            } else {
+                assertEquals(0, v, "Mismatch at position "+k);

Review Comment:
   `" + k`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +296,114 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes 
results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes 
results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int 
from11) {
+        BloomFilter bfFrom0 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom0.merge(new IncrementingHasher(0, 1));
+        BloomFilter bfFrom1 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom1.merge(TestingHashers.FROM1);
+        BloomFilter bfFrom11 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom11.merge(TestingHashers.FROM11);
+
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(0, 1)));
+        assertEquals(0, bf.getMaxInsert(bfFrom0));
+        assertEquals(0, bf.getMaxInsert((BitMapProducer) bfFrom0));
+        assertEquals(0, bf.getMaxInsert((IndexProducer) bfFrom0));
+
+        assertEquals(from1, bf.getMaxInsert(TestingHashers.FROM1));
+        assertEquals(from1, bf.getMaxInsert(bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((BitMapProducer) bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((IndexProducer) bfFrom1));
+
+        assertEquals(from11, bf.getMaxInsert(TestingHashers.FROM11));
+        assertEquals(from11, bf.getMaxInsert(bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((BitMapProducer) bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((IndexProducer) bfFrom11));
+    }
+
+    @Test
+    public void testGetMaxInsert() {
+        CountingBloomFilter bf = createEmptyFilter(getTestShape());
+        verifyMaxInsert(bf, 0, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 2, 0);
+        bf.merge(TestingHashers.FROM11);
+        verifyMaxInsert(bf, 2, 1);
+        bf.remove(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 1);
+        // verify remove false positive works
+        // Incrementing hasher 5,1 spans the single count cells for both FROM1 
and FROM11
+        assertEquals(1, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+        bf.remove(new IncrementingHasher(5, 1));
+        verifyMaxInsert(bf, 0, 0);
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+    }
+
+    private void assertCell3(CountingBloomFilter bf, int value) {
+        bf.forEachCell((k, v) -> {
+            if (k == 3) {
+                assertEquals(value, v, "Mismatch at position (3) "+k);
+            } else {
+                assertEquals(0, v, "Mismatch at position "+k);
+            }
+            return true;
+        });
+    }
+
+    @Test
+    public void mergeIncrementsAllCellsTest() {
+        CountingBloomFilter f1 = createEmptyFilter(Shape.fromKM(1, 10));
+        CountingBloomFilter f2 = f1.copy();
+        CountingBloomFilter f3 = f1.copy();
+        // index producer produces 3 two times.
+        IndexProducer ip = p -> {
+            p.test(3);
+            p.test(3);
+            return true;
+        };
+        // The merge should increment cell 3 by 1
+        f1.merge(ip);
+        assertCell3(f1, 1);
+
+        // The add should increment cells 3 by 2
+        f2.add(CellProducer.from(ip));
+        assertCell3(f2, 2);
+
+        // This merge will increment by 1 as the round-trip makes the indices 
unique

Review Comment:
   This test is redundant. The round trip does not make the indices unique. So 
the test is effectively:
   ```java
   f3.merge(ip)
   ```
   This is same as the test using f1.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/CellProducerFromDefaultIndexProducerTest.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+public class CellProducerFromDefaultIndexProducerTest extends 
AbstractCellProducerTest {
+
+    int[] data = { 0, 63, 1, 64, 128, 1, 127 };

Review Comment:
   whitespace



##########
src/test/java/org/apache/commons/collections4/bloomfilter/CellProducerFromArrayCountingBloomFilterTest.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+public class CellProducerFromArrayCountingBloomFilterTest extends 
AbstractCellProducerTest {
+
+    protected Shape shape = Shape.fromKM(17, 72);
+
+    @Override
+    protected CellProducer createProducer() {
+        final ArrayCountingBloomFilter filter = new 
ArrayCountingBloomFilter(shape);
+        filter.merge(new IncrementingHasher(0, 1));
+        filter.merge(new IncrementingHasher(5, 1));
+        return filter;
+    }
+
+    @Override
+    protected CellProducer createEmptyProducer() {
+        return new ArrayCountingBloomFilter(shape);
+    }
+
+    @Override
+    protected int[] getExpectedIndices() {
+        return new int[] {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
15, 16, 17, 18, 19, 20, 21};
+    }
+
+    @Override
+    protected int[] getExpectedValues() {
+        return new int[] { 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
1, 1, 1, 1, 1};

Review Comment:
   whitespace after {



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java:
##########
@@ -119,4 +118,42 @@ public void testFromIndexArray() {
             assertArrayEquals(expected, ip.asIndexArray());
         }
     }
+
+    @Test
+    public void testMoreThan32Entries() {
+        int[] values = new int[33];
+        for (int i=0; i<33; i++) {

Review Comment:
   int[] values = IntStream.range(0, 33).toArray();



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The 
term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the 
equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be 
ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one 
matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not 
output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for 
the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count 
is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit 
flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The 
consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, 
{@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code 
false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all 
cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a 
cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the 
IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be 
created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx 
).toArray();
+            }
+
+            @Override
+            public boolean forEachCell(CellConsumer consumer) {
+                populate();
+                for (CounterCell cell : counterCells.values()) {
+                    if (!consumer.test(cell.idx, cell.count) ) {

Review Comment:
   whitespace



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java:
##########
@@ -119,4 +118,42 @@ public void testFromIndexArray() {
             assertArrayEquals(expected, ip.asIndexArray());
         }
     }
+
+    @Test
+    public void testMoreThan32Entries() {
+        int[] values = new int[33];
+        for (int i=0; i<33; i++) {
+            values[i] = i;
+        }
+        IndexProducer producer =  predicate -> {
+            Objects.requireNonNull(predicate);
+            for (final int i : values) {
+                if (!predicate.test(i)) {
+                    return false;
+                }
+            }
+            return true;
+        };
+        int[] other = producer.asIndexArray();
+        assertArrayEquals(values, other);
+    }
+
+    @Test
+    public void test32Entries() {

Review Comment:
   You can combine this with the test above as:
   ```java
   @ParameterizedTest
   @ValueSource(ints = {32, 33})
   void testEntries(int size) {
       int[] values = IntStream.range(0, size).toArray();
   ```
   



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends 
AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected 
indices entries.
+     * @return an array of expected values.
+     * @see AbstractIndexProducerTest#getExpectedIndices()
+     */
+    protected abstract int[] getExpectedValues();
+
+    @Override
+    protected final int getAsIndexArrayBehaviour() {
+        return ORDERED | DISTINCT;
+    }
+
+    /**
+     * Creates a producer with some data.
+     * @return a producer with some data
+     */
+    @Override
+    protected abstract CellProducer createProducer();
+
+    /**
+     * Creates a producer without data.
+     * @return a producer that has no data.
+     */
+    @Override
+    protected abstract CellProducer createEmptyProducer();
+
+    @Test
+    public final void testForEachCellPredicates() {
+        final CellProducer populated = createProducer();
+        final CellProducer empty = createEmptyProducer();
+
+        assertFalse(populated.forEachCell(FALSE_CONSUMER), "non-empty should 
be false");
+        assertTrue(empty.forEachCell(FALSE_CONSUMER), "empty should be true");
+
+        assertTrue(populated.forEachCell(TRUE_CONSUMER), "non-empty should be 
true");
+        assertTrue(empty.forEachCell(TRUE_CONSUMER), "empty should be true");
+    }
+
+    @Test
+    public final void testEmptyCellProducer() {
+        final CellProducer empty = createEmptyProducer();
+        final int ary[] = empty.asIndexArray();
+        assertEquals(0, ary.length);
+        assertTrue(empty.forEachCell((i, j) -> {
+            fail("forEachCell consumer should not be called");
+            return false;
+        }));
+    }
+
+    @Test
+    public final void testIndexConsistency() {
+        final CellProducer producer = createProducer();
+        final BitSet bs1 = new BitSet();
+        final BitSet bs2 = new BitSet();
+        producer.forEachIndex(i -> {
+            bs1.set(i);
+            return true;
+        });
+        producer.forEachCell((i, j) -> {
+            bs2.set(i);
+            return true;
+        });
+        assertEquals(bs1, bs2);
+    }
+
+    @Test
+    public void testForEachCellValues() {
+        int[] expectedIdx = getExpectedIndices();
+        int[] expectedValue = getExpectedValues();
+        assertEquals( expectedIdx.length, expectedValue.length, "expected 
index length and value length do not match");

Review Comment:
   whitespace



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -103,40 +172,41 @@ default boolean merge(final BloomFilter other) {
     /**
      * Merges the specified Hasher into this Bloom filter.
      *
-     * <p>Specifically: all counts for the unique indexes identified by the 
{@code hasher} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the unique indexes identified by the 
{@code hasher} will be incremented by 1.</p>
      *
      * <p>This method will return {@code true} if the filter is valid after 
the operation.</p>
      *
      * @param hasher the hasher
      * @return {@code true} if the removal was successful and the state is 
valid
      * @see #isValid()
-     * @see #add(BitCountProducer)
+     * @see #add(CellProducer)
      */
     @Override
     default boolean merge(final Hasher hasher) {
         Objects.requireNonNull(hasher, "hasher");
-        return merge(hasher.uniqueIndices(getShape()));
+        return merge(hasher.indices(getShape()));
     }
 
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code 
indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all unique cells for the indices identified by the 
{@code indexProducer} will be incremented by 1.</p>
      *
      * <p>This method will return {@code true} if the filter is valid after 
the operation.</p>
      *
-     * <p>Note: Indices that are returned multiple times will be incremented 
multiple times.</p>
+     * <p>Note: If indices that are returned multiple times should be 
incremented multiple times convert the IndexProducer
+     * to a CellProducer and add that.</p>
      *
      * @param indexProducer the IndexProducer
      * @return {@code true} if the removal was successful and the state is 
valid
      * @see #isValid()
-     * @see #add(BitCountProducer)
+     * @see #add(CellProducer)
      */
     @Override
     default boolean merge(final IndexProducer indexProducer) {
         Objects.requireNonNull(indexProducer, "indexProducer");
         try {
-            return add(BitCountProducer.from(indexProducer));
+            return add(CellProducer.from(indexProducer.uniqueIndices()));
         } catch (final IndexOutOfBoundsException e) {

Review Comment:
   Note: Although not related to this PR I find it presumptuous that the code 
may throw an index out of bounds exception. This exception is possible from the 
`uniqueIndices` default implementation (see my comments on that elsewhere in 
this PR), and if the `indexProducer` has its own unique indices and is broken, 
it would be from the `add` method of the implementing counting Bloom filter. 
That is allowed to throw whatever runtime exception it desires. So why are we 
wrapping this one possible exception, and then re-throwing it as another RTE 
when neither are documented? This should be discussed somewhere (e.g. mailing 
list).



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends 
AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected 
indices entries.

Review Comment:
   `aligns`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends 
AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected 
indices entries.
+     * @return an array of expected values.
+     * @see AbstractIndexProducerTest#getExpectedIndices()
+     */
+    protected abstract int[] getExpectedValues();
+
+    @Override
+    protected final int getAsIndexArrayBehaviour() {
+        return ORDERED | DISTINCT;
+    }
+
+    /**
+     * Creates a producer with some data.
+     * @return a producer with some data
+     */
+    @Override
+    protected abstract CellProducer createProducer();
+
+    /**
+     * Creates a producer without data.
+     * @return a producer that has no data.
+     */
+    @Override
+    protected abstract CellProducer createEmptyProducer();
+
+    @Test
+    public final void testForEachCellPredicates() {
+        final CellProducer populated = createProducer();
+        final CellProducer empty = createEmptyProducer();
+
+        assertFalse(populated.forEachCell(FALSE_CONSUMER), "non-empty should 
be false");
+        assertTrue(empty.forEachCell(FALSE_CONSUMER), "empty should be true");
+
+        assertTrue(populated.forEachCell(TRUE_CONSUMER), "non-empty should be 
true");
+        assertTrue(empty.forEachCell(TRUE_CONSUMER), "empty should be true");
+    }
+
+    @Test
+    public final void testEmptyCellProducer() {
+        final CellProducer empty = createEmptyProducer();
+        final int ary[] = empty.asIndexArray();
+        assertEquals(0, ary.length);
+        assertTrue(empty.forEachCell((i, j) -> {
+            fail("forEachCell consumer should not be called");
+            return false;
+        }));
+    }
+
+    @Test
+    public final void testIndexConsistency() {
+        final CellProducer producer = createProducer();
+        final BitSet bs1 = new BitSet();
+        final BitSet bs2 = new BitSet();
+        producer.forEachIndex(i -> {
+            bs1.set(i);
+            return true;
+        });
+        producer.forEachCell((i, j) -> {
+            bs2.set(i);
+            return true;
+        });
+        assertEquals(bs1, bs2);
+    }
+
+    @Test
+    public void testForEachCellValues() {
+        int[] expectedIdx = getExpectedIndices();
+        int[] expectedValue = getExpectedValues();
+        assertEquals( expectedIdx.length, expectedValue.length, "expected 
index length and value length do not match");
+        int[] idx = {0};
+        createProducer().forEachCell((i, j) -> {
+            assertEquals(expectedIdx[idx[0]], i, "bad index at "+idx[0]);
+            assertEquals(expectedValue[idx[0]], j, "bad value at "+idx[0]);
+            idx[0]++;
+            return true;
+        });
+    }
+
+    /**
+     * Test the behavior of {@link CellProducer#forEachCell(CellConsumer)} 
with respect
+     * to ordered and distinct indices. Currently the behavior is assumed to 
be the same as
+     * {@link IndexProducer#forEachIndex(java.util.function.IntPredicate)}.
+     */
+    @Test
+    public final void testBehaviourForEachCell() {
+        final IntList list = new IntList();
+        createProducer().forEachCell((i, j) -> list.add(i));
+        final int[] actual = list.toArray();
+        // check order
+        final int[] expected = Arrays.stream(actual).sorted().toArray();
+        assertArrayEquals(expected, actual);
+        // check distinct
+        final long count = Arrays.stream(actual).distinct().count();
+        assertEquals(count, actual.length);
+    }
+
+    @Test
+    public void testForEachCellEarlyExit() {
+        final int[] passes = new int[1];
+        assertTrue(createEmptyProducer().forEachCell((i, j) -> {
+            passes[0]++;
+            return false;
+        }));
+        assertEquals(0, passes[0]);
+
+        assertFalse(createProducer().forEachCell((i, j) -> {

Review Comment:
   This test heavily assumes that the `createProducer` will have more than 1 
cell populated so you can early exit.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {

Review Comment:
   Note: This method now avoids the BitSet. As such I think we can remove the 
javadoc warning that the default implementation is slow. The note about 
returning in unique values in order is wrong. The default implementation now 
returns an array using the same order and cardinality as the `forEachIndex` 
method.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +296,114 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes 
results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes 
results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int 
from11) {

Review Comment:
   `(Counting`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultCellProducerTest.java:
##########
@@ -16,21 +16,27 @@
  */
 package org.apache.commons.collections4.bloomfilter;
 
-public class DefaultBitCountProducerTest extends AbstractBitCountProducerTest {
+public class DefaultCellProducerTest extends AbstractCellProducerTest {
 
     /** Make forEachIndex unordered and contain duplicates. */
-    private final int[] values = {10, 1, 10, 1};
+    private final int[] indices = {1, 2, 3, 5};
+    private final int[] values = {1, 4, 9, 25};
 
     @Override
     protected int[] getExpectedIndices() {
+        return indices;
+    }
+
+    @Override
+    protected int[] getExpectedValues() {
         return values;
     }
 
     @Override
-    protected BitCountProducer createProducer() {
+    protected CellProducer createProducer() {
         return consumer -> {
-            for (final int i : values) {
-                if (!consumer.test(i, 1)) {
+            for (int i=0; i<indices.length; i++) {

Review Comment:
   `for (int i = 0; i < indices.length; i++) {`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +296,114 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes 
results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes 
results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int 
from11) {
+        BloomFilter bfFrom0 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom0.merge(new IncrementingHasher(0, 1));
+        BloomFilter bfFrom1 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom1.merge(TestingHashers.FROM1);
+        BloomFilter bfFrom11 = new 
DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom11.merge(TestingHashers.FROM11);
+
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(0, 1)));
+        assertEquals(0, bf.getMaxInsert(bfFrom0));
+        assertEquals(0, bf.getMaxInsert((BitMapProducer) bfFrom0));
+        assertEquals(0, bf.getMaxInsert((IndexProducer) bfFrom0));
+
+        assertEquals(from1, bf.getMaxInsert(TestingHashers.FROM1));
+        assertEquals(from1, bf.getMaxInsert(bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((BitMapProducer) bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((IndexProducer) bfFrom1));
+
+        assertEquals(from11, bf.getMaxInsert(TestingHashers.FROM11));
+        assertEquals(from11, bf.getMaxInsert(bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((BitMapProducer) bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((IndexProducer) bfFrom11));
+    }
+
+    @Test
+    public void testGetMaxInsert() {
+        CountingBloomFilter bf = createEmptyFilter(getTestShape());
+        verifyMaxInsert(bf, 0, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 2, 0);
+        bf.merge(TestingHashers.FROM11);
+        verifyMaxInsert(bf, 2, 1);
+        bf.remove(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 1);
+        // verify remove false positive works
+        // Incrementing hasher 5,1 spans the single count cells for both FROM1 
and FROM11
+        assertEquals(1, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+        bf.remove(new IncrementingHasher(5, 1));
+        verifyMaxInsert(bf, 0, 0);
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+    }
+
+    private void assertCell3(CountingBloomFilter bf, int value) {
+        bf.forEachCell((k, v) -> {
+            if (k == 3) {
+                assertEquals(value, v, "Mismatch at position (3) "+k);

Review Comment:
   Drop the `+k`, or drop the `(3)` as they are the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Reply via email to