[
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964753#comment-16964753
]
Claude Warren commented on COLLECTIONS-728:
-------------------------------------------
=== From the conversation above ===
Why doesn't Shape keep a reference to Hasher? That would avoid having to pass
both to e.g. method contains in BloomFilter (and would make is unnecessary to
check their consistency).
(see the response to how are StaticHasher and DynamicHasher different for some
extra background).
The contains method with the Hasher was requested because one of the other
Apache projects was interested in checking the presence of the Bloom filter
without building a complete Bloom filter. They have the hasher and they know
the shape it should be. So the method actually needs the Hasher for the data.
Now, they can build the DynamicHasher and pass that, or they could build a
StaticHasher (which does know it's shape), or they could implement another
Hasher type.
Placing the Hasher in the Shape would mean that the shape will always return
the same data when what we are doing is trying to send different data with the
same shape.
Still it doesn't look right, API-wise; but again I'm lacking a clear view of
all the requirements to provide an alternative.
=== end of quote ===
The methods that have both the Hasher and the Shape are really passing
decomposed Bloom filters. The Hasher contains the hash values and the Shape
identifies shape of the decomposed Bloom filter. True the Hasher could be
limited to a StaticHasher and the Shape would not be needed as the Static
Hahser contains the Shape it was built with, but in the general case a Hasher
does not have a Shape.
The reason for the separation is the Hasher applies a hash function to data and
returns int values.
The shape indicates to the Hasher how many times it should apply the hashing
algorithm (k in the maths) to each data item being hashes (item = the X in
DynamicHasher.with(X)). The Shape also give the upper limit for the values the
hasher returns on the particular getBits() call. The upper limit is "m" in the
maths.
So Hasher + Shape = an iteration of integers that are the bits turned on the
Bloom filter.
A single hasher can generated Bloom filters of an almost infinite number of
Shapes.
--- snip --
=== start of quote ===
>From KISS and DRY POVs, I disagree with both statements: The factory could be
>immutable and readily provide all the useful algorithms (see e.g.
>RandomSource).
=== end of quote ===
There are two issues with the idea that we can provide all the usefull
algorithms.
1) We will always be behind when it comes to implementing new algorithms using
new hash functions. It is better to allow the users in the field to add
algorithms as they see fit. I remember an instructor one said the the only
valid limits in computer science are zero, one and many. Providing an
implementation with some but not allowing for addition of others is very
limiting.
2) There are known cases (Cassandra comes to mind) where an implementation of a
Hashing algorithm was incorrectly executed and released into the wild. If such
an application wanted to use this library they would be unable to add their
"broken" hashing to the factory would be an inhibitor.
So why not provide a factory that " provide all the useful algorithms" and has
the convenience methods to add hash functions?
Once we add hash functions it seems a short step to resetting the factory to
default. At a minimum this would be needed for testing.
> BloomFilter contribution
> ------------------------
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
> Issue Type: Task
> Reporter: Claude Warren
> Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java,
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated
> collections.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)