[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000048#comment-17000048
 ] 

Claude Warren commented on COLLECTIONS-728:
-------------------------------------------

In all the work I have done with Bloom filters over the past 8 years, I have 
come to believe that there are several pieces of information that are necessary 
to determine if the hashing is the same for two filters.

1) they hashing algorithm must be the same.  Generally the algorithms are 
specified by name and this seems standard across the industry where hashing is 
used.

2) The results of the hash have to be converted into integer type values and 
the modulus taken.  So the question becomes were the bits in the result buffer 
assumed to be signed or unsigned, little endian or bit endian.

3) How was the hashing used.  In construction the Bloom filter multiple hashes 
are used.  In traditional iterative methods this is done by calling the 
selected hash function with a different seed for each hash required.  The 
second method described by Adam Kirsch and Micheal Mitzenmacher[1] has become 
more common and is used in applications like Cassandra[2].

There are other considerations, like is the hashing algorithm implemented 
correctly, but I believe they fall outside the normal bounds of equivalence 
checking during normal operation.

I originally encoded these 3 properties into the name.  While this made for 
simple checks it posed other problems and was, rightly so, objected to.  
Instead the 3 properties are now tracked in the HashFunctionIdentity.

The identity is separated from the hash function implementation because there 
are times when the hash function implementation can not be preserved; 
specifically when sending the BloomFilter across a the wire.  All we really 
want to send is the filter and the shape.

The shape preserves the standard configuration of the Bloom filter (number of 
items, number of bits, number of hash functions) as well as the 
HashFunctionIdentity.  I believe this to be the minimal set necessary to 
determine whether it is valid to compare two Bloom filters.

There are other cases where specific properties of the hash function identity 
are required.  For example, I have an implementation of Hasher that uses the 
first 2 results returned from a CYCLIC hash function to allow it to build Bloom 
filters with any number of items, number of bits, or number of hash functions.  
This is similar to the StaticHasher but useful where different sized filters 
are required to be built from the same object.

An equals method on  the HashFunction would require the HashFunction object be 
in some sense Serializable.  That is, the object would have to be decomposable 
so that its identity could be sent across the wire.  I would not expect the 
actual code for for the function to be sent.  This would probably require that 
the listener be a Java based application so that it could have the HashFunction 
implementation.  The HashFunctionIdentity object makes this much easier as the 
component values are easily serialized.

 

[1] [https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf]

[2] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java#L60]

 

> BloomFilter contribution
> ------------------------
>
>                 Key: COLLECTIONS-728
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to