[
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990840#comment-16990840
]
Claude Warren commented on COLLECTIONS-728:
-------------------------------------------
The issue of broken hashes is a problem and I admit that the name does not
solve that, it does stop an MD5 hash being used with a Murmur128 hash, which
was its original purpose.
Stronger checks could be enforced, however when working with Bloom filters one
is typically doing so in hopes of achieving a performance improvement (e.g. a
quick check to filter out unnecessary longer lookups) in these cases
performing a hash on every check/insertion is extra overhead that may not be
tolerable to the application. As an indication of how sensitive this can be, I
have been working on a multidimentional Bloom filter problem. The solution is
so time constrained that looking up items in a Trie adds enough overhead that
in many cases the multidimentional Bloom filter is no more efficient than a
linear search of the equivalent list of Bloom filters. The overhead of an
in-depth check (like the code necessary to navigate a Trie) will significantly
impact the performance of the Bloom filter.
I like the idea. Perhaps there is an additional utility class to be added to
do such checks, but I don't think the check belongs in the mainline code.
What we need is a quick, easily extensible, mechanism to determine if two hash
functions are expected to produce the same result. I proposed a string (name),
but am open to other ideas. I am thinking of a more binary encoding
comprising a byte of flags and a string hash name and a string for the provider
name. Only 2 flags defined (Signed/Unsigned and Cyclic/Iterative). The
verification check would ensure that the flags and name are equal. A more in
depth check could be implemented by anybody who wanted to restrict to a single
implementation, etc.
Perhaps it would make more sense to follow the JCA pattern and have a factory
that allowed users to register provider libraries and then request by name,
flags, and optional provider.
The resulting implementation would therefore have to have methods to return the
provider, and flag values.
Something like:
{noformat}
public interface HashFunction extends ToLongBiFunction<byte[], Integer> {
String getName();
String getProvider();
boolean isIterative();
boolean isUnsigned();
}
{noformat}
The factory would work like the JCA factories in that the order of the provider
determines which implementation of a HashFunction is returned when multiple
providers provide the "same function".
> BloomFilter contribution
> ------------------------
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
> Issue Type: Task
> Reporter: Claude Warren
> Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java,
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated
> collections.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)