First I think that the hasher.getBits( Shape ) should be renamed to iterator( Shape ). It was poorly named at the start.
By definition a Hasher knows how many items it is going to insert. The Shape tells the hasher how many hash functions to apply to each item. The Shape number of items is how many items are expected to be in the final Bloom filter, it is more the expected value not a hard limit. Keeping in mind the possibility of hash collisions, I don't see a way to check that the Hasher has respected the number of functions. The static hasher for example will not return duplicates so it might appear that it has not respected the number of functions. In addition there is no indication from the hasher how many items it contains.. The inputs to the hash.builder are byte buffers that are fed to the hash algorithm. They are inputs to that algorithm. So primitive types would simply convert from the primitive type to its byte buffer representation. Is that what you meant? The hasher contract is that it will generate integers in the proper range and use the proper number of hash functions for each item that was added to the builder and that repeated calls to getBits(Shape) will return the same values. Did I misunderstand something? Claude On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <alex.d.herb...@gmail.com> wrote: > > On 16/03/2020 07:57, Claude Warren wrote: > > I made a quick pass at changing getHasher() to iterator(). > > A look at the feasibility or have you started work on this? If so then > I'll not start work on it as well. > > I changed master to return a boolean for the merge operations in > BloomFilter. So the outstanding changes are to drop getHasher() from the > BloomFilter interface in favour of an iterator, spliterator and a > forEachBit method. > > > I think we can get rid of HasherBloomFilter as its purpose was really to > > create a Bloom filter for temporary usage and it doesn't seem to be > > required if we have a hasher that can be created from a Shape and a > > function that creates an Iterator. > > I agree. > > One change that could be made is to clarify the contract between a > Hasher and a BloomFilter. At present the Hasher can operate without a > defined contract in this method: > > PrimitiveIterator.OfInt getBits(Shape shape) > > It should validate that it can generate indexes for the shape. But it > doesn't have to. It could return unlimited indexes and they could be > outside the number of bits of the BloomFilter. > > There does not appear to be any control anywhere on the number of hash > functions generated by the Hasher. I would expect this test in the > AbstractBloomFilterTest to pass: > > @Test > public void hasherMergeTest() { > int n = 1; > int m = 10; > HashFunctionIdentity h = new > HashFunctionIdentityImpl("provider", "name", > Signedness.SIGNED, ProcessType.CYCLIC, 0L); > Hasher hasher = new Hasher() { > @Override > public boolean isEmpty() { > return false; > } > @Override > public HashFunctionIdentity getHashFunctionIdentity() { > return h; > } > @Override > public OfInt getBits(Shape shape) { > // Do not respect the shape number of hash functions > but do respect > // the number of bits > return IntStream.range(0, m).iterator(); > } > }; > for (int k = 1; k < 5; k++) { > Shape shape = new Shape(h, n, m, k); > BloomFilter bf = createEmptyFilter(shape); > bf.merge(hasher); > assertEquals("incorrect cardinality", k, bf.cardinality()); > } > } > > It currently does not as all the BloomFilters will not respect the Shape > with which they were created, i.e. they disregard the number of hash > functions in the Shape. So does the Hasher. > > I think some of the control should be returned to the BloomFilter. The > Hasher would be reduced to a simple generator of data for the > BloomFilter, for example: > > PrimitiveIterator.OfInt getBits(int m); > PrimitiveIterator.OfInt getBits(int k, int m); > PrimitiveIterator.OfLong getBits(); > > The BloomFilter then accept responsibility for converting the primitives > to a suitable index and creating the correct number of hash functions > (i.e. indexes). > > A merge operation with a BloomFilter then becomes: > > - check the Hasher is using the correct hash function identity > - ask the Hasher for an iterator > - set k bits in the filter using the iterator, mapping each to the range > [0, m) > > The BloomFilter has then encapsulated its state and respects the Shape. > > The HashFuntion will convert byte[] to a long. > > The Hasher exists to convert anything to a byte[] format. > > This perhaps needs the Hasher.Builder to be revised to include more > methods that accept all the primitive data types. These are all > converted to a single byte[] representation. Thus the Hasher.Builder is > effectively a specification for a ByteBuffer. Once an object is > decomposed into the byte[] it can be fed through the HashFunction with > different seeds or using the cyclic method to create the iterator. The > BloomFilter consumes the raw long output from the stream produced by the > Hasher and sets k bits within the range m. > > Alex > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren