Hi. Le ven. 18 oct. 2019 à 17:55, Claude Warren <cla...@xenei.com> a écrit : > > I think the other discussion is getting a bit long so I thought we could > start this discussion here and see if we can close out the other discussion > with agreement on the remaining topics. > > The “Shape” of a bloom filter (excluding the hash algo) is defined > mathematically by > > Number of Items (AKA: n) n = ceil(m / (-k / log(1 - exp(log(p) / k)))) > > Probability of Collision (AKA: p) p = (1 – exp(-kn/m))^k > > Number of Bits (AKA: m) m = ceil((n * log(p)) / log(1 / pow(2, log(2)))) > > Number of Functions (AKA: k) k = round((m / n) * log(2)) > > Since the probability (p) is actually calculated directly from the other 3 > the shape is actually concretely defined by n, m, and k. > > This is all in the BloomFilterConfiguration class. > > The BloomFilterConfigurationClass has 4 constrcutors (n,p), (p,m,k), (n,m), > and (n,m,k). in all cases where is an "p" it is assumed to be a desired > value and values of the other missing properties are calculated. The actual > value for p is then recalculated from n, m and k. > > Now it makes sense that the hash algo should also be specified I am happy > to assume an enum here for the moment and also assume a mechanism for > resolving the enum to a concrete hash implementation. > > I think that you will find that BloomFilterConfiguration fits the bill but > perhaps should be renamed to BloomFilterShape. > > > > There is one other component that we have not addressed lately: The > ProtoBloomFilter. > > The ProtoBloomFilter addresses an issue that arises when there are filters > with different “shape” but same Hashing Algo. > > Since hashing is the expensive operation, we want to try to cache the hash > values and reuse them rather than recalculate them. > > The ProtoBloomFilter contains the results of the hashing in a format that > is easily used to create filters of the specified shape. > > The current implementation uses a circular method whereby a murmur128 hash > is generated for a parameter to a with() call (e.g. with(“foo”) generates a > murmur128 hash of “foo”) > the hash is split into 2 longs. > > Now when the shape has a “k” of: > 1 we return the first long. > 2 we add the second long to the first long and return that. > 3 we add the second long the the result above and return that. > > This has been shown to be as randomly distributed as calling > murmur128( “foo”, (seed1) ) > murmur128( “foo”, (seed2) ) > murmur128( “foo”, (seed3) ) > > and much faster. > > We can do this with any implementation that produces 128 bit hashes. > > Once we have the long value for the specific k we take the mod() of that > and the number of bits in the Bloom filter (m). For Bloom filters that have > small enough (m) we could use smaller integers and it would not have a > significant impact on the distribution of the bits. > > There is an issue of whether the long value is considered signed or > unsigned as this significantly impacts the mod(). This is why I suggested a > string like Murmur128-UC or Murmur128-SC (unsigned or signed). > > If instead the ProtoBloomFilter called the hash function directly then it > would be what I called an incremented call. So calling > > murmur128( “foo”, (seed1) ) > murmur128( “foo”, (seed2) ) > murmur128( “foo”, (seed3) ) > > inside the ProtoBloomFilter would yield a Murmur128-UI or Murmur128-SI hash > depending on whether the values were assumed to be signed or unsigned for > the mod() call. > > Anyway, we can rename the ProtoBloomFilter to something else, but it > effectively generates the required number of functions (k) for each with(x) > parameter. > > So you take the BloomFilterConfiguration (Shape) and the ProtoBloomFilter > (k provider) and you can build the BloomFilter. > > There are some problems here. Almost none of the nomenclature is common. So > we can define as we go. I think there is a need for a proper hashing naming > were we can determine how the hash was created from the name (like the with > encryption algo names and similar to the RandomSource code you pointed me > to), but I have found no standards here.
At the issue page https://issues.apache.org/jira/browse/COLLECTIONS-728 I've attached a design suggestion from what I gathered from this discussion. [It's incomplete and not tested, some names were changed (TBD), and the commented out part ("HashSource") will not compile until more code is added.] Prominent feature is the explicit namespace that groups in a single file all the boiler-plate functionality needed by implementations of the "BloomFilter" interface. [This is also in line with the layout of the top-level package of [Collections] that contains various interfaces, each with their implementations located in a specific sub-package.] Next step would be to create a ---CUT--- public class BitSetBloomFilter implements BloomFilter { private final BitSet state; // ... } ---CUT--- WDYT? Regards, Gilles > > Claude > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org