[
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000048#comment-17000048
]
Claude Warren commented on COLLECTIONS-728:
-------------------------------------------
In all the work I have done with Bloom filters over the past 8 years, I have
come to believe that there are several pieces of information that are necessary
to determine if the hashing is the same for two filters.
1) they hashing algorithm must be the same. Generally the algorithms are
specified by name and this seems standard across the industry where hashing is
used.
2) The results of the hash have to be converted into integer type values and
the modulus taken. So the question becomes were the bits in the result buffer
assumed to be signed or unsigned, little endian or bit endian.
3) How was the hashing used. In construction the Bloom filter multiple hashes
are used. In traditional iterative methods this is done by calling the
selected hash function with a different seed for each hash required. The
second method described by Adam Kirsch and Micheal Mitzenmacher[1] has become
more common and is used in applications like Cassandra[2].
There are other considerations, like is the hashing algorithm implemented
correctly, but I believe they fall outside the normal bounds of equivalence
checking during normal operation.
I originally encoded these 3 properties into the name. While this made for
simple checks it posed other problems and was, rightly so, objected to.
Instead the 3 properties are now tracked in the HashFunctionIdentity.
The identity is separated from the hash function implementation because there
are times when the hash function implementation can not be preserved;
specifically when sending the BloomFilter across a the wire. All we really
want to send is the filter and the shape.
The shape preserves the standard configuration of the Bloom filter (number of
items, number of bits, number of hash functions) as well as the
HashFunctionIdentity. I believe this to be the minimal set necessary to
determine whether it is valid to compare two Bloom filters.
There are other cases where specific properties of the hash function identity
are required. For example, I have an implementation of Hasher that uses the
first 2 results returned from a CYCLIC hash function to allow it to build Bloom
filters with any number of items, number of bits, or number of hash functions.
This is similar to the StaticHasher but useful where different sized filters
are required to be built from the same object.
An equals method on the HashFunction would require the HashFunction object be
in some sense Serializable. That is, the object would have to be decomposable
so that its identity could be sent across the wire. I would not expect the
actual code for for the function to be sent. This would probably require that
the listener be a Java based application so that it could have the HashFunction
implementation. The HashFunctionIdentity object makes this much easier as the
component values are easily serialized.
[1] [https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf]
[2]
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java#L60]
> BloomFilter contribution
> ------------------------
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
> Issue Type: Task
> Reporter: Claude Warren
> Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java,
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated
> collections.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)