[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990840#comment-16990840
 ] 

Claude Warren commented on COLLECTIONS-728:
-------------------------------------------

The issue of broken hashes is a problem and I admit that the name does not 
solve that, it does stop an MD5 hash being used with a Murmur128 hash, which 
was its original purpose.

Stronger checks could be enforced, however when working with Bloom filters one 
is typically doing so in hopes of achieving a performance improvement (e.g. a 
quick check to filter out unnecessary longer lookups) in  these cases 
performing a hash on every check/insertion is extra overhead that may not be 
tolerable to the application.  As an indication of how sensitive this can be, I 
have been working on a multidimentional Bloom filter problem.  The solution is 
so time constrained that looking up items in a Trie adds enough overhead that 
in many cases the multidimentional Bloom filter is no more efficient than a 
linear search of the equivalent list of Bloom filters.  The overhead of an 
in-depth check (like the code necessary to navigate a Trie) will significantly 
impact the performance of the Bloom filter.

I like the idea.  Perhaps there is an additional utility class to be added to 
do such checks, but I don't think the check belongs in the mainline code. 

What we need is a quick, easily extensible, mechanism to determine if two hash 
functions are expected to produce the same result.  I proposed a string (name), 
but am open to other ideas.  I am  thinking of a more binary encoding 
comprising a byte of flags and a string hash name and a string for the provider 
name.  Only 2 flags defined (Signed/Unsigned and Cyclic/Iterative).  The 
verification check would ensure that the flags and name are equal.  A more in 
depth check could be implemented by anybody who wanted to restrict to a single 
implementation, etc.

Perhaps it would make more sense to follow the JCA pattern and have a factory 
that allowed users to register provider libraries and then request by name, 
flags, and optional provider.

The resulting implementation would therefore have to have methods to return the 
provider, and flag values.

Something like:
{noformat}
    public interface HashFunction extends ToLongBiFunction<byte[], Integer> {
        String getName();
        String getProvider();
        boolean isIterative();
        boolean isUnsigned();
    }
{noformat}

The factory would work like the JCA factories in that the order of the provider 
determines which implementation of a HashFunction is returned when multiple 
providers provide the "same function".





> BloomFilter contribution
> ------------------------
>
>                 Key: COLLECTIONS-728
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to