On 16/03/2020 07:57, Claude Warren wrote:
I made a quick pass at changing getHasher() to iterator().

A look at the feasibility or have you started work on this? If so then I'll not start work on it as well.

I changed master to return a boolean for the merge operations in BloomFilter. So the outstanding changes are to drop getHasher() from the BloomFilter interface in favour of an iterator, spliterator and a forEachBit method.

I think we can get rid of HasherBloomFilter as its purpose was really to
create a Bloom filter for temporary usage and it doesn't seem to be
required if we have a hasher that can be created from a Shape and a
function that creates an Iterator.

I agree.

One change that could be made is to clarify the contract between a Hasher and a BloomFilter. At present the Hasher can operate without a defined contract in this method:

PrimitiveIterator.OfInt getBits(Shape shape)

It should validate that it can generate indexes for the shape. But it doesn't have to. It could return unlimited indexes and they could be outside the number of bits of the BloomFilter.

There does not appear to be any control anywhere on the number of hash functions generated by the Hasher. I would expect this test in the AbstractBloomFilterTest to pass:

    @Test
    public void hasherMergeTest() {
        int n = 1;
        int m = 10;
        HashFunctionIdentity h = new HashFunctionIdentityImpl("provider", "name",
            Signedness.SIGNED, ProcessType.CYCLIC, 0L);
        Hasher hasher = new Hasher() {
            @Override
            public boolean isEmpty() {
                return false;
            }
            @Override
            public HashFunctionIdentity getHashFunctionIdentity() {
                return h;
            }
            @Override
            public OfInt getBits(Shape shape) {
                // Do not respect the shape number of hash functions but do respect
                // the number of bits
                return IntStream.range(0, m).iterator();
            }
        };
        for (int k = 1; k < 5; k++) {
            Shape shape = new Shape(h, n, m, k);
            BloomFilter bf = createEmptyFilter(shape);
            bf.merge(hasher);
            assertEquals("incorrect cardinality", k, bf.cardinality());
        }
    }

It currently does not as all the BloomFilters will not respect the Shape with which they were created, i.e. they disregard the number of hash functions in the Shape. So does the Hasher.

I think some of the control should be returned to the BloomFilter. The Hasher would be reduced to a simple generator of data for the BloomFilter, for example:

    PrimitiveIterator.OfInt getBits(int m);
    PrimitiveIterator.OfInt getBits(int k, int m);
    PrimitiveIterator.OfLong getBits();

The BloomFilter then accept responsibility for converting the primitives to a suitable index and creating the correct number of hash functions (i.e. indexes).

A merge operation with a BloomFilter then becomes:

- check the Hasher is using the correct hash function identity
- ask the Hasher for an iterator
- set k bits in the filter using the iterator, mapping each to the range [0, m)

The BloomFilter has then encapsulated its state and respects the Shape.

The HashFuntion will convert byte[] to a long.

The Hasher exists to convert anything to a byte[] format.

This perhaps needs the Hasher.Builder to be revised to include more methods that accept all the primitive data types. These are all converted to a single byte[] representation. Thus the Hasher.Builder is effectively a specification for a ByteBuffer. Once an object is decomposed into the byte[] it can be fed through the HashFunction with different seeds or using the cyclic method to create the iterator. The BloomFilter consumes the raw long output from the stream produced by the Hasher and sets k bits within the range m.

Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to