[
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960025#comment-16960025
]
Alex Herbert commented on COLLECTIONS-728:
------------------------------------------
Following on from a conversation had on GitHub for the initial BloomFilter code
contribution PR the example there was a Bloom filter created for 10,000 objects
with a false positive probability of 1/50000. This requires a Bloom filter of
225,200 bits with 16 bits set per object hash. Thus even though the Bloom
filter may require a lot of storage to represent the combination of thousands
of objects each object only requires a very small amount of storage. In
addition you only need to find a single bit index in the object hash that is
not in the Bloom filter to determine that there is no match.
Thus the common use case of checking if an object matches the filter is to
compute a relatively small set of indices and check each one is in the Bloom
filter. This would suggest the following:
{code:java}
public interface BloomFilter {
public interface BitIterator {
/**
* Get the next set bit, or -1 if no more bits are set.
*
* @return the bit index (or -1)
*/
int nextSetBit();
static BitIterator of(int... bits) {
return new BitIterator() {
int count;
@Override
public int nextSetBit() {
if (count < bits.length) {
return bits[count++];
}
return -1;
}
}
}
}
default boolean match(BitIterator bits) {
int index = bits.nextSetBit();
if (index == -1) {
// Cannot match no indices
return false;
}
while (index != -1) {
if (!get(index)) {
return false;
}
}
return true;
}
boolean get(int bitIndex);
}{code}
Thus the Bloom filter satisfies the following:
# A BloomFilter is a representation of the logical OR (combination) of many
hashes.
# A hash is a specification of indices that are set to 1.
# A BloomFilter can answer the question: is bit index _n_ set?
# A BloomFilter can answer the question: are all bit indices _N_ set?
# A "hasher" is able to convert an object to a set of indices.
# A comparison of each index in turn opens the possibility of dynamic
computation of each hash index with fail fast matching.
To use a BloomFilter would require an implementation of hashing that generates
a number of indices for your objects and a concrete representation of the
BloomFilter composite of many hashes.
This is just an idea and I can see that if constant time access to the method
{{get(int)}} is not available then the filter would be slow. It is an
alternative and can be combined with the current idea to build an entire
BloomFilter for each new object and perform a match of two BloomFilters:
{code:java}
public interface BloomFilter {
public interface BitIterator {
/**
* Get the next set bit, or -1 if no more bits are set.
*
* @return the bit index (or -1)
*/
int nextSetBit();
}
boolean match(BloomFilter other);
boolean match(BitIterator bits);
}
{code}
> BloomFilter contribution
> ------------------------
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
> Issue Type: Task
> Reporter: Claude Warren
> Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java,
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated
> collections.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)