[
https://issues.apache.org/jira/browse/COLLECTIONS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Claude Warren updated COLLECTIONS-824:
--------------------------------------
Description:
{noformat}
public boolean forEachIndex(IntPredicate consumer) {
Objects.requireNonNull(consumer, "consumer");
int bits = shape.getNumberOfBits();
/*
* Essentially this is computing a wrapped modulus from a start
point and an
* increment. So actually you only need two modulus operations
before the loop.
* This avoids any modulus operation inside the while loop. It
uses a long index
* to avoid overflow.
*/
long index = mod(initial, bits);
int inc = mod(increment, bits);
for (int functionalCount = 0; functionalCount <
shape.getNumberOfHashFunctions(); functionalCount++) {
if (!consumer.test((int) index)) {
return false;
}
index += inc;
index = index >= bits ? index - bits : index;
}
return true;
}
{noformat}
This can be fixed to be non zero by using a smaller modulus:
{noformat}
// Ensure non-zero increment
int inc = 1 + mod(increment, bits - 1);
{noformat}
This will prevent the 1/bits occurrence that the increment is zero and the hash
is poor (it will be a single index).
If bits == 1 then this will throw an exception. However if bits==1 this is an
edge case that can be handled immediately after obtaining the number of bits as:
{noformat}
if (bits == 1) {
// No need to send k duplicates, the filter shape has been
constructed incorrectly.
return consumer.test(0);
}
{noformat}
https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144
Here we have index as a long. However if the increment (which is always
positive) is subtracted we can use an int and compare to zero:
{noformat}
index -= inc;
index = index < 0 ? index + bits : index;
{noformat}
This does break a lot of unit tests. These tests have been created using
SimpleHasher knowing what the hash modulus will be. The tests should be updated
to use a FixedHasher in the test package:
{noformat}
/**
* To be used for testing only. The fixed indices must fit within the
filter shape.
*/
class FixedHasher implements Hasher {
private final int[] indices;
private int[] distinct;
FixedHasher(int... indices) {
this.indices = indices.clone();
}
@Override
public IndexProducer indices(Shape shape) {
checkIndices(shape);
return IndexProducer.fromIndexArray(indices);
}
@Override
public IndexProducer uniqueIndices(Shape shape) {
checkIndices(shape);
int[] d = distinct;
if (d == null) {
distinct = d = Arrays.stream(indices).distinct().toArray();
}
return IndexProducer.fromIndexArray(d);
}
private void checkIndices(Shape shape) {
// Check the hasher is OK for the shape
int bits = shape.getNumberOfBits();
for (int i : indices) {
Assertions.assertTrue(i < bits);
}
}
}
{noformat}
This hasher will allow filters to be constructed and manipulated. Or change it
to use % bits and drop the assertion in checkIndices.
The SimpleHasher should just be tested that it outputs random indices. I would
suggest creating a few thousand randomly seeded hashers, outputting 5 indices
from each and building a histogram of the counts of each index. This should be
uniform when tested with a Chi-square test. This can be repeated for a few
different shape sizes.
Updating the filter in this way will still implement a Krisch and Mitzenmacher
hasher where:
{noformat}
g_i(x) = (h1(x) + i h2(x)) mod m
{noformat}
The only difference is the subtraction rather than addition.
The wikipedia article on [Double Hashing
(variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that
this type of hasher when used in Bloom filters will result in a collision rate
that is twice as likely as intended.
An alternative is to provide enhanced double hashing which changes the
increment each loop iteration. This would require some additional work on
wrapping the modulus of the increment.
It may be better to change the name to KMHasher and more explicitly state that
the hash functions are:
{noformat}
g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
{noformat}
was:
The wikipedia article on [Double Hashing
(variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that
this type of hasher when used in Bloom filters will result in a collision rate
that is twice as likely as intended.
An alternative is to provide enhanced double hashing which changes the
increment each loop iteration. This would require some additional work on
wrapping the modulus of the increment.
It may be better to change the name to KMHasher and more explicitly state that
the hash functions are:
{noformat}
g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
{noformat}
> BloomFilter: change name of SimpleHasher
> ----------------------------------------
>
> Key: COLLECTIONS-824
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-824
> Project: Commons Collections
> Issue Type: Improvement
> Components: Collection
> Affects Versions: 4.5
> Reporter: Claude Warren
> Priority: Major
> Labels: bloom-filter
>
> {noformat}
> public boolean forEachIndex(IntPredicate consumer) {
> Objects.requireNonNull(consumer, "consumer");
> int bits = shape.getNumberOfBits();
> /*
> * Essentially this is computing a wrapped modulus from a
> start point and an
> * increment. So actually you only need two modulus
> operations before the loop.
> * This avoids any modulus operation inside the while loop.
> It uses a long index
> * to avoid overflow.
> */
> long index = mod(initial, bits);
> int inc = mod(increment, bits);
> for (int functionalCount = 0; functionalCount <
> shape.getNumberOfHashFunctions(); functionalCount++) {
> if (!consumer.test((int) index)) {
> return false;
> }
> index += inc;
> index = index >= bits ? index - bits : index;
> }
> return true;
> }
> {noformat}
> This can be fixed to be non zero by using a smaller modulus:
> {noformat}
> // Ensure non-zero increment
> int inc = 1 + mod(increment, bits - 1);
> {noformat}
> This will prevent the 1/bits occurrence that the increment is zero and the
> hash is poor (it will be a single index).
> If bits == 1 then this will throw an exception. However if bits==1 this is an
> edge case that can be handled immediately after obtaining the number of bits
> as:
> {noformat}
> if (bits == 1) {
> // No need to send k duplicates, the filter shape has
> been constructed incorrectly.
> return consumer.test(0);
> }
> {noformat}
> https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144
> Here we have index as a long. However if the increment (which is always
> positive) is subtracted we can use an int and compare to zero:
> {noformat}
> index -= inc;
> index = index < 0 ? index + bits : index;
> {noformat}
> This does break a lot of unit tests. These tests have been created using
> SimpleHasher knowing what the hash modulus will be. The tests should be
> updated to use a FixedHasher in the test package:
> {noformat}
> /**
> * To be used for testing only. The fixed indices must fit within the
> filter shape.
> */
> class FixedHasher implements Hasher {
> private final int[] indices;
> private int[] distinct;
> FixedHasher(int... indices) {
> this.indices = indices.clone();
> }
> @Override
> public IndexProducer indices(Shape shape) {
> checkIndices(shape);
> return IndexProducer.fromIndexArray(indices);
> }
> @Override
> public IndexProducer uniqueIndices(Shape shape) {
> checkIndices(shape);
> int[] d = distinct;
> if (d == null) {
> distinct = d = Arrays.stream(indices).distinct().toArray();
> }
> return IndexProducer.fromIndexArray(d);
> }
> private void checkIndices(Shape shape) {
> // Check the hasher is OK for the shape
> int bits = shape.getNumberOfBits();
> for (int i : indices) {
> Assertions.assertTrue(i < bits);
> }
> }
> }
> {noformat}
> This hasher will allow filters to be constructed and manipulated. Or change
> it to use % bits and drop the assertion in checkIndices.
> The SimpleHasher should just be tested that it outputs random indices. I
> would suggest creating a few thousand randomly seeded hashers, outputting 5
> indices from each and building a histogram of the counts of each index. This
> should be uniform when tested with a Chi-square test. This can be repeated
> for a few different shape sizes.
> Updating the filter in this way will still implement a Krisch and
> Mitzenmacher hasher where:
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m
> {noformat}
> The only difference is the subtraction rather than addition.
> The wikipedia article on [Double Hashing
> (variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that
> this type of hasher when used in Bloom filters will result in a collision
> rate that is twice as likely as intended.
> An alternative is to provide enhanced double hashing which changes the
> increment each loop iteration. This would require some additional work on
> wrapping the modulus of the increment.
> It may be better to change the name to KMHasher and more explicitly state
> that the hash functions are:
>
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
> {noformat}
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)