Re: [BloomFilters] changes to BloomFilter

Alex Herbert Tue, 17 Mar 2020 06:54:27 -0700


> On 17 Mar 2020, at 11:08, Claude Warren <cla...@xenei.com> wrote:
> 
> On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert <alex.d.herb...@gmail.com 
> <mailto:alex.d.herb...@gmail.com>>
> wrote:
> 
>> 
>>> The Shape tells the hasher how many hash functions to apply to each item.
>> 
>> OK. This is may misunderstanding. There is a contract that the Hasher is
>> expected to fulfil but it is just not recorded in the javadoc. I can update
>> the docs to indicate that:
>> 
>> "A Hasher represents items of arbitrary byte size as a byte representation
>> of fixed size (a hash). The hash for each item is created using a hash
>> function; use of different seeds allows generation of different hashes for
>> the same item. The hashes can be dynamically converted into the bit index
>> representation used by a Bloom filter. The shape of the Bloom filter
>> defines the number of indexes per item and the range of the indexes. The
>> hasher functions to generate the correct number of indexes in the range
>> required by the Bloom filter for each item it represents.
>> 
>> Note that the process of generating hashes and mapping them to a Bloom
>> filter shape may create duplicate indexes. The hasher may generate fewer
>> than the required number of hash functions per item if duplicates have been
>> removed."
>> 
>> Looks Good


Added to master.

> 
>>> The Shape number of items is how many items are expected to be in the
>> final
>>> Bloom filter, it is more the expected value not a hard limit.
>> 
>> Yes. As discussed before this is not actually required for a Bloom filter
>> to function, it is required to maintain the intended purpose of the filter
>> when it was constructed.
>> 
>> 
>> 
>>> 
>>> The static hasher for example will not return duplicates so it might
>> appear
>>> that it has not respected the number of functions.  In addition there is
>> no
>>> indication from the hasher how many items it contains..
>> 
>> Yes. So we state that the hasher represents one or more items.
>> 
>>> 
>>> The inputs to the hash.builder are byte buffers that are fed to the hash
>>> algorithm.  They are inputs to that algorithm.  So primitive types would
>>> simply convert from the primitive type to its byte buffer representation.
>>> Is that what you meant?
>> 
>> I was unclear on the purpose of the Hasher.Builder. It seemed incomplete.
>> If the builder is to add items then it seems strange to have:
>> 
>> with(byte property)
>> with(String property)
>> 
>> It also seems strange to throw 'IllegalStateException if the Hasher is
>> locked’ without explaining what this means. Is the builder intended to be
>> concurrent? What is ‘locked’? Etc.
>> 
> 
> Good catch. Documentation error.  Originally the Hasher.Builder was locked
> once it generated a Hasher that was removed but the documentation was not
> cleaned up.

I’ve removed this from the Hasher.Builder interface.

> 
> 
>> The byte could not possibly represent many meaningful objects. The string
>> is trivially converted to UTF-8 bytes (as is done in the DynamicHasher).
>> Both these methods could be added to the interface as default methods or
>> preferrably dropped as they are so trivial.
>> 
>> I changed the documentation to remove the encoding as UTF-8 requirement
>> from the with(String) method. It seems like an implementation detail and a
>> Hasher.Builder implementation can decide how to convert the String. It is
>> faster to use UTF-16 bytes for instance. I understand UTF-8 is for
>> cross-platform standard. But mandating that it has to be done is too
>> restrictive IMO. It would be better as:
>> 
>> with(CharSequence, Charset)
>> withUnencoded(CharSequence)
>> 
>> 
> The Hasher.Builder is patterned after the cryptography MessageDigest
> pattern where a number of update() calls are made before a final digest()
> produces the hash.  Originally it was update() and build() but earlier
> suggestion was to change the update() to with().
> 
> Adding with( byte[] buffer, int offset, int len ) makes sense.
> 
> with(String ) is a convenience method, it was documented as UTF-8 so that
> if when  trying to build equivalent filters in a different language there
> was no need to look deep in the hasher code to figure out what to do.
> 
> with( byte ) is also a convenience method.
> 
> They are documented in Hasher.Builder to be common across all builders.
> Obviously different Builder implementations can add other functionality.
> 
> One could build a decorator that wraps a builder and adds the with(<T>)
> functionality.
> 
> The Builder interface is designed to be a balance between minimal
> implementation need to be functional and an implementation that can handle
> all arbitrary types.

OK. What I think is different from the MessageDigest pattern is that a 
MessageDigest is designed to receive byte data that is composes into a single 
digest representing the combined bytes.

Here the Hasher.Builder is designed to create a Hasher that represents one or 
more items. Each item is a set of byte data. So each method to add data is 
effectively functioning as a complete addition of a single item. This is how 
the implementation of the DynamicHasher works. Each call to add data will add a 
byte[] to the List<byte[]>. When you call iterator(Shape) on the resulting 
Hasher you will get an iterator that generates k indexes per byte[] buffer that 
was added.

Thus my point that adding a single byte is confusing. Not many entire objects 
would have a single byte representation!

It may be simpler to drop the methods with(String) and with(byte). There is the 
option to add:

with(byte[], int offset, int length)

However I am reluctant to do this as it would just be a delay of the inevitable 
copy to a duplicate array of the correct length given that the 
HashFunction.apply(byte[], int) method does not support a range. So you would 
have to extract the entire byte[] anyway.

To aid the process of adding byte[] to the Hasher.Builder we can provided a 
ByteBuilder that would function similarly to a StringBuilder with append 
methods for all sorts of data. However I am wary that the JDK did not make 
ByteBuffer expandable as it impacts performance. A ByteBuilder would 
effectively be a duplicate of ByteBuffer but expandable. So perhaps we add 
instead this method:

with(ByteBuffer)

The method will extract the current contents of the ByteBuffer between the 
current position and the limit into a new byte[] and add it to the builder:

default Builder with(ByteBuffer) {
    final int remaining = byteBuffer.remaining();
    final byte[] byteArray = new byte[remaining];
    byteBuffer.get(byteArray);
    return with(byteArray);
}

Thus this is a bridge between hashing byte[] and the ability to hash anything, 
e.g.

            Hasher.Builder builder = ...;
            ByteBuffer bb = ByteBuffer.allocate(100);
            bb.putInt(42);
            bb.putDouble(123.45);
            bb.flip();
            builder.with(bb);
            bb.putInt(3);
            bb.putLong(78493577979L);
            bb.flip();
            Hasher hasher = builder.with(bb).build();

Given the design is fixed on having byte[] to pass entirely to a HashFunction 
then we cannot remove the need to create intermediate byte[] storage.

An alternative is to formalise the convertor idea with a typed function:

<T> Builder with(T item, Function<T, byte[]> convertor);

Or provide a Builder decorator (which you discuss later):

    class BuilderDecorator<T> implements Hasher.Builder {
        Function<T, byte[]> convertor;
        Hasher.Builder builder;

        BuilderDecorator(Function<T, byte[]> convertor, Hasher.Builder builder) 
{
            this.convertor = convertor;
            this.builder = builder;
        }

        @Override
        public Hasher build() {
            return builder.build();
        }

        @Override
        public BuilderDecorator<T> with(byte[] property) {
            builder.with(property);
            return this;
        }

        public BuilderDecorator<T> with(T item) {
            return with(convertor.apply(item));
        }
    }

This is for the future perhaps.

> 
> 
>> I was interpreting the Hasher.Builder as a builder of a single byte[] for
>> hashing where you would pass different primitive values or byte[] for the
>> same Object you want to convert. This is essentially a ByteBuffer. But if
>> it is to receive an entire object for each call then (a) it should be
>> documented as such; (b) it should be simplified to just the byte[] method
>> with perhaps another one/two:
>> 
>> with(byte[])
>> with(byte[], int length)
>> with(T)
>> with(ByteBuffer)
>> 
> 
>> Adding the T method would make the interface typed as Hasher.Builder<T>.
>> It would require a function to convert items T to a byte[]:
>> 
>> 
> I think the T and ByteBuffer implementations are better left to a decorator
> class.  But that is just my take.  Since the Bloom filter only supports a
> logical equivalent of equals putting numerical values into a Bloom filter
> is not that common.  My experience is that most values are Strings.

So we drop all method from the Builder except with(byte[]) and change the 
‘property’ to ‘item’ to clarify that the byte[] is representing an entire item, 
not a property of a larger item.

> 
> Collection<T> items = ...
>> BloomFilter bf = …
>> Function<T, byte[]> converter = …
>> HashFunction hf = ...
>> 
>> for (T item : items) {
>>    bf.merge(new DynamicHasher.Builder<>(hf,
>> converter).with(item).build());
>> }
>> 
>> Or:
>> 
>> DynamicHasher.Builder<T> builder = new DynamicHasher.Builder<>(hf,
>> converter);
>> for (T item : Collection<T>) {
>>    builder.with(item);
>> }
>> bf.merge(builder.build());
>> 
>> I think the Hasher.Builder interface can be removed. It does not really
>> add anything to the API without a factory somewhere to be able to create
>> Hasher.Builder instances since each instance has no methods for reset:
>> 
>> At one time Builder.build() performed a reset but that was removed by
> request, I would have no issues putting it back, or adding a reset method.

Add:

Builder reset();
Hasher buildAndReset();

Or force build() to do a reset as well? It is simplest to make build() mandate 
a reset(). Any situations where you would want to do incremental builds by 
adding more items to a builder and getting a bigger Hasher as a super-set of 
the previous Hasher?

> 
> The Hasher.Builder interface defines a common interface across all
> builders.  So I can write my code to build Bloom filters and swap out the
> hasher implementation easily as long as I stick to the Builder.Hasher
> methods.
> 
> Hasher h = factory.create().with(x).with(y).with(z).build();
>> 
>> If you do not have either a factory to create a Hasher.Builder or the
>> ability to reset a Builder then why have a Hasher.Builder interface?
>> Passing around just a single instance of the builder has limited use. I
>> would drop the interface and leave it to Hasher implementations to define
>> how they want to be constructed.
>> 
>>> 
>>> The hasher contract is that it will generate integers in the proper range
>>> and use the proper number of hash functions for each item that was added
>> to
>>> the builder and that repeated calls to getBits(Shape) will return the
>> same
>>> values.
>>> 
>>> Did I misunderstand something?
>> 
>> No, I did. Hence the need to clarify all the javadoc.
>> 
>> What I think we are missing with the Hasher is the simplicity of the Guava
>> implementation. What you ideally would like to do is:
>> 
>> Collection<T> items = ...
>> BloomFilter bf = …
>> 
>> for (T item : items) {
>>    bf.merge(item);
>> }
>> 
>> This change significantly reduces the applicability of this implementation
> of Bloom filter across applications.  I am currently working on several
> applications where Hashers are sent across a network to query repositories.
> 
> By separating the Hasher from the Filter (separation of concerns) then I
> don't have to send large objects across the network to do the query.  Nor
> do I have to expose the properties I am querying for.  In addition the
> endpoints I am querying are free to resize the bloom filters they store as
> they see fit based on size of collection and other Shape based concerns.
> With the Guava implementation this is not possible.
> 
> Note: "properties" in the above paragraph are items placed into a single
> builder.

So as above, change ‘property’ to ‘item’.

> 
> Currently you have to do something like:
>> 
>> Collection<T> items = ...
>> BloomFilter bf = …
>> Function<T, Hasher> itemToHasher = …
>> 
>> for (T item : items) {
>>    bf.merge(itemToHasher.apply(item));
>> }
>> 
>> The itemToHasher is effectively an improved DynamicHasher.Builder<T> as
>> above.
>> 
> 
> Yes, and simple to construct from the components in the bloomfilter
> packages.  In addition, I think you have made the assumption that T
> contains a single value to be added, in which case a Function<T,byte[]>
> would suffice and, assuming bf is not a CountingBloom filter, the above can
> be written as:
> 
> Hasher.Builder builder = new ....
> Function<T,byte[]> f = new ...
> items.iterator().forEachRemaining( t -> builder.with( f.apply(t) )
> bf.merge( builder.build() )
> 
> Additionally if T is an object with multiple elements that are to be added
> to the filter then
> 
> Hasher.Builder nativeBuilder = new ....
> DecoratorT builder = new DecoratorT( nativeBuilder ) // adds the with(T)
> method
> items.iterator().forEachRemaining( t -> builder.with( t ) )
> bf.merge( builder.build() )
> 
> there is no difference in result between
> -
> loop {
> bf.merge( new Builder().with(x).build() )
> }
> 
> -- and --
> 
> new Builder()
> loop {
> builder.with( x )
> }
> bf.merge( builder.build() );
> 
> 
>> It would also be possible for the itemToHasher to recycle byte[] space by
>> returning the same Hasher object that has been reset and filled with the
>> next item. This would not be thread-safe but would save on intermediate
>> storage.
>> 
>> All of this still fixes on having a byte[] representation to feed to a
>> HashFunction. Moving away from the current design would change HashFunction
>> to specify methods to accept blocks of data and have a final method to get
>> the hash. So making the HashFunction an online hash. This would then match
>> Guava by having the some object accept items T and require a function to
>> map the item T to blocks of data.
>> 
>> However I note that the current implementation that accepts a byte[] for
>> each call to get a hash value with a different seed can either use the
>> byte[] or not (if cyclic). If the HashFunction was online then the choice
>> of cyclic or iterative would not easily be possible. The Guava
>> implementation creates a single hash and then the BloomFilter always uses
>> this with a cyclic method. So the move away from the current design would
>> be less flexible to allow different implementations of hashing.
>> 
>> So we keep the byte[] interface to HashFunction for now. A performance
>> test can be used to determine if there is an advantage to an advanced
>> DynamicHasher.Builder which can recycle byte[] space. Javadoc should be
>> added to the HashFunction to indicate that the same bytes passed with the
>> same seed should create the same output. The same bytes with a different
>> seed should create different output with very high probability. A seed of
>> zero is used as a reset signal for implementations that have cached
>> computation results that the byte[] input is different from the previous
>> call.
>> 
>> 
>> The last thing is that the Hasher.isEmpty() is not used anywhere except
>> the units tests. It seems strange to have it. Can we just assume a Hasher
>> is not empty. An empty hasher would return an iterator that does nothing.
>> 
>> Yes, I guess we don't need isEmpty() I think it was originally used in
> BloomFilter.merge() in a guard statement to test if the merge actually
> needed to attempt to do anything.

I have removed it.
> 
> 
>> In summary:
>> 
>> 1. change Hasher getBits to iterator
>> 
> agree

Done.

> 
>> 2. improve documentation of Hasher and the contract that it should fulfil
>> with respect to items and a Shape
>> 
> absolutly

I’ve updated the Javadoc as above.

> 
>> 3. potentially drop Hasher.Builder unless there is a way to reset the
>> Builder or create more
>> 
> add the reset or have build() implicitly do a reset.

I think we mandate Builder be reusable and that build() forces a reset. I also 
think dropping all the method except with(byte[]) makes it very clear that the 
builder is to accept complete items on each call.

> 
> 4. Or make Hasher.Builder typed to an object <T> so it is clear the with(…)
>> methods are to accept a full representation of an item and add it to the
>> in-progress Hasher currently being built
>> 
> disagree.

Fine. We shall have a raw API for now and possibly decorators for objects 
hashing later.

> 
> 5. Improve HashFunction javadoc on the use of the seed as a reset signal
> 
> agree

Here there are more options. If we document that apply(byte[], long) is called 
with a seed of zero to initialise and then non-zero seed to create new hashes 
for the same byte[] why not make the interface split into two methods:

long hash(byte[]);
long increment(int seed);

We already have the ProcessType property for the HashFunction that allows the 
function to specify if the entire byte[] is used each time or if the initial 
hash is recycled when creating new values from a non-zero seed. From what I can 
see in the implementations the cyclic functions would just ignore the seed in 
the increment(int) method. The iterative functions use the seed.

I tweaked the ObjectsHashIterative HashFunction. But it seems to me that it 
could be further improved if we can assume that apply(byte[], int) is called 
with the same byte[] when the seed is non-zero. If this is true then the 
Arrays.hashCode(byte[]) result can be cached for reuse. The function should 
also be made UNSIGNED as the 32-bit result can be converted to an unsigned long.

Splitting the HashFunction interface into two methods makes it clear that an 
implementation will be called multiple times with the same byte[]. This is part 
of the design of the Hasher and so should be formalised into the design of the 
HashFunction.

WDYT?

> 
>> 6. Drop Hasher.isEmpty()
>> 
> ambivalent.

Dropped.

> 
>> 
>> That should clarify the currently functionality.
>> 
>> Thought on this?
>> 
>> Alex
>> 
>> 
>>> 
>>> Claude
>>> 
>>> 
>>> On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <alex.d.herb...@gmail.com>
>>> wrote:
>>> 
>>>> 
>>>> On 16/03/2020 07:57, Claude Warren wrote:
>>>>> I made a quick pass at changing getHasher() to iterator().
>>>> 
>>>> A look at the feasibility or have you started work on this? If so then
>>>> I'll not start work on it as well.
>>>> 
>>>> I changed master to return a boolean for the merge operations in
>>>> BloomFilter. So the outstanding changes are to drop getHasher() from the
>>>> BloomFilter interface in favour of an iterator, spliterator and a
>>>> forEachBit method.
>>>> 
>>>>> I think we can get rid of HasherBloomFilter as its purpose was really
>> to
>>>>> create a Bloom filter for temporary usage and it doesn't seem to be
>>>>> required if we have a hasher that can be created from a Shape and a
>>>>> function that creates an Iterator.
>>>> 
>>>> I agree.
>>>> 
>>>> One change that could be made is to clarify the contract between a
>>>> Hasher and a BloomFilter. At present the Hasher can operate without a
>>>> defined contract in this method:
>>>> 
>>>> PrimitiveIterator.OfInt getBits(Shape shape)
>>>> 
>>>> It should validate that it can generate indexes for the shape. But it
>>>> doesn't have to. It could return unlimited indexes and they could be
>>>> outside the number of bits of the BloomFilter.
>>>> 
>>>> There does not appear to be any control anywhere on the number of hash
>>>> functions generated by the Hasher. I would expect this test in the
>>>> AbstractBloomFilterTest to pass:
>>>> 
>>>>    @Test
>>>>    public void hasherMergeTest() {
>>>>        int n = 1;
>>>>        int m = 10;
>>>>        HashFunctionIdentity h = new
>>>> HashFunctionIdentityImpl("provider", "name",
>>>>            Signedness.SIGNED, ProcessType.CYCLIC, 0L);
>>>>        Hasher hasher = new Hasher() {
>>>>            @Override
>>>>            public boolean isEmpty() {
>>>>                return false;
>>>>            }
>>>>            @Override
>>>>            public HashFunctionIdentity getHashFunctionIdentity() {
>>>>                return h;
>>>>            }
>>>>            @Override
>>>>            public OfInt getBits(Shape shape) {
>>>>                // Do not respect the shape number of hash functions
>>>> but do respect
>>>>                // the number of bits
>>>>                return IntStream.range(0, m).iterator();
>>>>            }
>>>>        };
>>>>        for (int k = 1; k < 5; k++) {
>>>>            Shape shape = new Shape(h, n, m, k);
>>>>            BloomFilter bf = createEmptyFilter(shape);
>>>>            bf.merge(hasher);
>>>>            assertEquals("incorrect cardinality", k, bf.cardinality());
>>>>        }
>>>>    }
>>>> 
>>>> It currently does not as all the BloomFilters will not respect the Shape
>>>> with which they were created, i.e. they disregard the number of hash
>>>> functions in the Shape. So does the Hasher.
>>>> 
>>>> I think some of the control should be returned to the BloomFilter. The
>>>> Hasher would be reduced to a simple generator of data for the
>>>> BloomFilter, for example:
>>>> 
>>>>    PrimitiveIterator.OfInt getBits(int m);
>>>>    PrimitiveIterator.OfInt getBits(int k, int m);
>>>>    PrimitiveIterator.OfLong getBits();
>>>> 
>>>> The BloomFilter then accept responsibility for converting the primitives
>>>> to a suitable index and creating the correct number of hash functions
>>>> (i.e. indexes).
>>>> 
>>>> A merge operation with a BloomFilter then becomes:
>>>> 
>>>> - check the Hasher is using the correct hash function identity
>>>> - ask the Hasher for an iterator
>>>> - set k bits in the filter using the iterator, mapping each to the range
>>>> [0, m)
>>>> 
>>>> The BloomFilter has then encapsulated its state and respects the Shape.
>>>> 
>>>> The HashFuntion will convert byte[] to a long.
>>>> 
>>>> The Hasher exists to convert anything to a byte[] format.
>>>> 
>>>> This perhaps needs the Hasher.Builder to be revised to include more
>>>> methods that accept all the primitive data types. These are all
>>>> converted to a single byte[] representation. Thus the Hasher.Builder is
>>>> effectively a specification for a ByteBuffer. Once an object is
>>>> decomposed into the byte[] it can be fed through the HashFunction with
>>>> different seeds or using the cyclic method to create the iterator. The
>>>> BloomFilter consumes the raw long output from the stream produced by the
>>>> Hasher and sets k bits within the range m.
>>>> 
>>>> Alex
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>> 
>>>> 
>>> 
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org 
>> <mailto:dev-unsubscr...@commons.apache.org>
>> For additional commands, e-mail: dev-h...@commons.apache.org 
>> <mailto:dev-h...@commons.apache.org>
>> 
>> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren 
> <http://www.linkedin.com/in/claudewarren>

Re: [BloomFilters] changes to BloomFilter

Reply via email to