aherbert commented on code in PR #335:
URL:
https://github.com/apache/commons-collections/pull/335#discussion_r1003576587
##########
src/main/java/org/apache/commons/collections4/bloomfilter/Hasher.java:
##########
@@ -49,7 +49,8 @@ public interface Hasher {
* Creates an IndexProducer of unique indices for this hasher based on the
Shape.
*
* <p>This is like the `indices(Shape)` method except that it adds the
guarantee that no
- * duplicate values will be returned</p>
+ * duplicate values will be returned. The indices produced are equivalent
those returned
Review Comment:
`equivalent to those`
##########
src/main/java/org/apache/commons/collections4/bloomfilter/HasherCollection.java:
##########
@@ -90,6 +90,20 @@ public IndexProducer indices(final Shape shape) {
return new HasherCollectionIndexProducer(shape);
}
+ /**
+ * Creates an IndexProducer of comprising the unique indices from each of
the contained
Review Comment:
delete `of` before comprising
##########
src/main/java/org/apache/commons/collections4/bloomfilter/HasherCollection.java:
##########
@@ -90,6 +90,20 @@ public IndexProducer indices(final Shape shape) {
return new HasherCollectionIndexProducer(shape);
}
+ /**
+ * Creates an IndexProducer of comprising the unique indices from each of
the contained
+ * hashers.
+ *
+ * <p>This method may return duplicates if the collection of unique values
from each of the contained
+ * hashers contain duplicates. This is equivalent to creating Bloom
filters for each contained hasher
+ * and returning concatenated IndexProducer from each.</p>
+ *
+ * <p>A BitCountProducer generated from this IndexProducer is equivalent
to a BitCountProducer from a
+ * counting Bloom filter that was constructed from the contained
hashers.<p>
Review Comment:
IIUC this is only true if the counting Bloom filter was constructed from
unique indices from the contained hashers. So perhaps change to: `constructed
from the contained hashers unique indices.`
##########
src/test/java/org/apache/commons/collections4/bloomfilter/BitCountProducerFromAbsoluteUniqueHasherCollectionTest.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+
+public class BitCountProducerFromAbsoluteUniqueHasherCollectionTest extends
AbstractBitCountProducerTest {
+
+ @Override
+ protected BitCountProducer createProducer() {
+ // hasher has collisions and wraps
+ return BitCountProducer.from(new HasherCollection(
+ new IncrementingHasher(1, 1),
Review Comment:
This tests a saturation of the number of bits. We should really have a test
for non saturation, no collisions just to make sure that works as well.
##########
src/main/java/org/apache/commons/collections4/bloomfilter/HasherCollection.java:
##########
@@ -141,29 +172,16 @@ public boolean forEachIndex(IntPredicate consumer) {
@Override
public int[] asIndexArray() {
- List<int[]> lst = new ArrayList<>();
- int[] count = new int[1];
- /*
- * This method needs to return duplicate indices
- */
- for (Hasher hasher : hashers) {
- int[] ary = hasher.indices(shape).asIndexArray();
- lst.add(ary);
- count[0] += ary.length;
- }
- if (lst.isEmpty()) {
- return new int[0];
- }
- if (lst.size() == 1) {
- return lst.get(0);
- }
- int[] result = new int[count[0]];
- int offset = 0;
- for (int[] ary : lst) {
- System.arraycopy(ary, 0, result, offset, ary.length);
- offset += ary.length;
- }
- return result;
+ int[] result = new
int[shape.getNumberOfHashFunctions()*hashers.size()];
Review Comment:
This is a much better implementation.
Can you add whitespace around the multiply ` * `.
##########
src/test/java/org/apache/commons/collections4/bloomfilter/BitCountProducerFromAbsoluteUniqueHasherCollectionTest.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+
+public class BitCountProducerFromAbsoluteUniqueHasherCollectionTest extends
AbstractBitCountProducerTest {
+
+ @Override
+ protected BitCountProducer createProducer() {
+ // hasher has collisions and wraps
+ return BitCountProducer.from(new HasherCollection(
+ new IncrementingHasher(1, 1),
+ new IncrementingHasher(2,
2)).absoluteUniqueIndices(Shape.fromKM(11, 10)));
+ }
+
+ @Override
+ protected BitCountProducer createEmptyProducer() {
+ return BitCountProducer.from(new
HasherCollection().uniqueIndices(Shape.fromKM(11, 10)));
+ }
+
+ @Override
+ protected int getBehaviour() {
+ // Hasher allows duplicates and may be unordered
+ return 0;
Review Comment:
Should this have flags for distinct?
##########
src/test/java/org/apache/commons/collections4/bloomfilter/BitCountProducerFromUniqueHasherCollectionTest.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+
+public class BitCountProducerFromUniqueHasherCollectionTest extends
AbstractBitCountProducerTest {
+
+ @Override
+ protected BitCountProducer createProducer() {
+ // hasher has collisions and wraps
+ return BitCountProducer.from(new HasherCollection(
+ new IncrementingHasher(1, 1),
+ new IncrementingHasher(2, 2)).uniqueIndices(Shape.fromKM(11,
10)));
+ }
+
+ @Override
+ protected BitCountProducer createEmptyProducer() {
+ return
BitCountProducer.from(NullHasher.INSTANCE.uniqueIndices(Shape.fromKM(11, 10)));
+ }
+
+ @Override
+ protected int getBehaviour() {
+ // Hasher allows duplicates and may be unordered
Review Comment:
`// HasherCollection allows ...`
##########
src/test/java/org/apache/commons/collections4/bloomfilter/EnhancedDoubleHasherTest.java:
##########
@@ -35,6 +36,12 @@ protected Hasher createEmptyHasher() {
return NullHasher.INSTANCE;
}
+
Review Comment:
Remove extra line
##########
src/main/java/org/apache/commons/collections4/bloomfilter/HasherCollection.java:
##########
@@ -90,6 +90,20 @@ public IndexProducer indices(final Shape shape) {
return new HasherCollectionIndexProducer(shape);
}
+ /**
+ * Creates an IndexProducer of comprising the unique indices from each of
the contained
+ * hashers.
+ *
+ * <p>This method may return duplicates if the collection of unique values
from each of the contained
+ * hashers contain duplicates. This is equivalent to creating Bloom
filters for each contained hasher
+ * and returning concatenated IndexProducer from each.</p>
Review Comment:
returning an IndexProducer with the concatenated output indices from each
filter.
##########
src/test/java/org/apache/commons/collections4/bloomfilter/IndexProducerFromAbsoluteUniqueHasherCollectionTest2.java:
##########
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+public class IndexProducerFromAbsoluteUniqueHasherCollectionTest2 extends
AbstractIndexProducerTest {
Review Comment:
Remove the 2 from the name suffix
##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractIndexProducerTest.java:
##########
@@ -93,8 +98,36 @@ int[] toArray() {
*/
protected abstract int getBehaviour();
+ /**
+ * Creates an array of expected indices.
+ * @return an array of expected indices.
+ */
+ protected abstract int[] getExpectedIndices();
+
+ @Test
+ public final void testAsIndexArrayValues() {
+ List<Integer> lst = new ArrayList<>();
Review Comment:
You have not removed the redundant part of this test. The distinctness is
tested in `testBehaviourAsIndexArray`.
##########
src/main/java/org/apache/commons/collections4/bloomfilter/HasherCollection.java:
##########
@@ -106,6 +120,23 @@ public boolean forEachIndex(IntPredicate consumer) {
};
}
+ /**
+ * Creates an IndexProducer of comprising the unique indices across all
the contained
+ * hashers.
+ *
+ * <p>This is equivalent to an IndexProducer created from a Bloom filter
that comprises all
+ * the contained hashers.</p>
+ *
+ * @param shape the shape of the desired Bloom filter.
+ * @return the iterator of integers
+ */
+ public IndexProducer absoluteUniqueIndices(final Shape shape) {
+ return consumer -> {
+ Objects.requireNonNull(consumer, "consumer");
+ return uniqueIndices(shape).forEachIndex(IndexFilter.create(shape,
consumer));
Review Comment:
IIUC this can exceed the capacity of the IndexFilter. The
uniqueIndices(shape) will output up to k indices for each hasher in the
collection (eliminating duplicates). Thus if we have two hashers with
shape(k=5, n=1000) we could output 10 indices from the combined unique indices.
Then we create an IndexFilter for shape which could be optimised only to check
up to k=5 indices.
This throws an index out of bounds exception from IndexFilter.ArrayTracker.
```Java
@Test
public void testAbsoluteUniqueIndices() {
int[] actual = new HasherCollection(
new IncrementingHasher(1, 1),
new IncrementingHasher(10, 1)
).absoluteUniqueIndices(Shape.fromKM(5, 1000)).asIndexArray();
int[] expected = IntStream.concat(
IntStream.range(1, 1 + 5),
IntStream.range(10, 10 + 5)
).toArray();
Assertions.assertArrayEquals(expected, actual);
}
```
When I update HasherCollection absoluteUniqueIndices to use a fake shape
that forces the IndexTracker.BitMapTracker implementation then the above tests
passes:
```Java
public IndexProducer absoluteUniqueIndices(final Shape shape) {
return consumer -> {
Objects.requireNonNull(consumer, "consumer");
return uniqueIndices(shape).forEachIndex(IndexFilter.create(
Shape.fromKM(shape.getNumberOfBits(),
shape.getNumberOfBits()), consumer));
};
}
```
I think that this can be easily solved by having a new package-private
factory constructor in IndexFilter to choose the implementation that can handle
exceeding the number of hash functions in the shape:
```Java
static IntPredicate create(int numberOfBits, IntPredicate consumer) {
// Choose the BitMapTracker implementation to handle any number of hash
functions
```
Or you could create a new Shape as:
```Java
public IndexProducer absoluteUniqueIndices(final Shape shape) {
return consumer -> {
Objects.requireNonNull(consumer, "consumer");
return uniqueIndices(shape).forEachIndex(IndexFilter.create(
Shape.fromKM(shape.getNumberOfHashFunctions() *
hashers.size(),
shape.getNumberOfBits()), consumer));
};
}
```
Perhaps the later is simpler.
##########
src/test/java/org/apache/commons/collections4/bloomfilter/UniqueIndexProducerFromHasherCollectionTest.java:
##########
@@ -36,4 +36,11 @@ protected int getBehaviour() {
// index from each hasher. The result is there may still be duplicates.
return 0;
}
+
+
Review Comment:
Remove extra line
##########
src/test/java/org/apache/commons/collections4/bloomfilter/IndexProducerFromAbsoluteUniqueHasherCollectionTest2.java:
##########
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+public class IndexProducerFromAbsoluteUniqueHasherCollectionTest2 extends
AbstractIndexProducerTest {
+
+ // selecting 11 items from the range [0,9] will cause a collision
+ private Shape shape = Shape.fromKM(11, 10);
+
+ @Override
+ protected IndexProducer createProducer() {
+ return new HasherCollection(new IncrementingHasher(1, 1), new
IncrementingHasher(2, 2)).absoluteUniqueIndices(shape);
+ }
+
+ @Override
+ protected IndexProducer createEmptyProducer() {
+ return new HasherCollection().absoluteUniqueIndices(shape);
+ }
+
+ @Override
+ protected int[] getExpectedIndices() {
+ return new int[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 0};
+ }
+
+ @Override
+ protected int getBehaviour() {
+ // HasherCollection allows duplicates and may be unordered
Review Comment:
Update comment as duplicates are not allowed from absoluteUniqueIndices.
Q. Should forEach also be distinct?
##########
src/test/java/org/apache/commons/collections4/bloomfilter/BitCountProducerFromUniqueHasherTest.java:
##########
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+public class BitCountProducerFromUniqueHasherTest extends
AbstractBitCountProducerTest {
+
+ @Override
+ protected BitCountProducer createProducer() {
+ // hasher has collisions and wraps
+ return BitCountProducer.from(new IncrementingHasher(4,
8).uniqueIndices(Shape.fromKM(17, 72)));
+ }
+
+ @Override
+ protected BitCountProducer createEmptyProducer() {
+ return
BitCountProducer.from(NullHasher.INSTANCE.indices(Shape.fromKM(17, 72)));
+ }
+
+ @Override
+ protected int getBehaviour() {
+ // Hasher allows duplicates and may be unordered
+ return AS_ARRAY_DISTINCT;
Review Comment:
Update the comment as this does not allow duplicates
Q. Is forEach also distinct?
##########
src/test/java/org/apache/commons/collections4/bloomfilter/BitCountProducerFromUniqueHasherCollectionTest.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+
Review Comment:
Remove empty line
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]