Re: [BloomFilters] changes to BloomFilter

Alex Herbert Wed, 22 Apr 2020 09:23:07 -0700

On Wed, 22 Apr 2020 at 13:56, Gary Gregory <garydgreg...@gmail.com> wrote:

> Hi All,
>
> I'd like to pat ourselves on the back for a smooth ride so far in bringing
> in new code for a new non-trivial feature :-)
>
> IMO, let's not get hung up on generics for the first cut of the Bloom
> Filter code. Why? Because generics are erased by the compiler and we can
> always add them later without break binary compatibility. Now, if adding
> generics later would require other changes or other changes would be
> desired with said generics which would end up breaking BC, then we should
> pause and consider why that is. Any thoughts on that?
>

Currently you add a Hasher to a BloomFilter. The BloomFilter is a set of n
bits. The Hasher provides indexes into the set given its size. The
BloomFilter can enable the bits based on a Hasher (merge) or state if
the bits are enabled for the Hasher (query). You can also merge and query
using a BloomFilter of the same shape.

A generic version would have the BloomFilter allow merge and query with
item T. This requires moving the control for conversion of an item into a
Hasher into the BloomFilter. This would require a restructure of the entire
BloomFilter Hasher paradigm so cannot be added later. An alternative is to
provide a higher level wrapper class that sits on top of a BloomFilter and
contains a mechanism to convert item T into a Hasher. This idea can be
built on top of the existing code.

So in summary I don't think we can convert what we currently have now to
generics. We can build on top of it a generics layer.

I've not checked the recent thread topics but I believe the present points
under discussion for change are:

- Switch to long indexes: Allows storing 64-times larger filters with a
basic array of long[]. Future future support could be much larger with
specialised structures.
- Switch the mechanism for providing the enabled bits of a BloomFilter.
This is to support serialisation and generic merge of two filters which may
maintain a different internal structure. Currently it uses a method to
return a long[] and also a frozen version of a Hasher. This could be
changed to use iterators allowing unlimited size. This is relevant if we
switch to long indexes and want to support more than Integer.MAX_VALUE * 64
bits (i.e. more that a long[] can possibly hold). Using a forEach type
pattern was also discussed. This was added to the CountingBloomFilter but
not the (super) BloomFilter.
- Revisit how a Hasher communicates to a BloomFilter that it is using the
underlying mechanism for generating indexes that all the other Hashers
added to the filter have used. Currently this uses a HashFunctionIdentity
which is based on a set of properties and also applying the HashFunction to
a string created from those properties. The system is fallible.
- Revisit how to convert any object to a Hasher. Currently this requires
conversion to a byte[] which is then used with a HashFunction to create
long output using a generator pattern. This is a topic where generics could
be used and how a generic framework can be created that allows a user to
specify the parts of object T that are important to add to the byte[] that
is used by the HashFunction.
- Possible support for dynamic HashFunctions that do not require a byte[]
and can do hashing on the fly.

Some of these topics require API changes. Thus my concern that the API is
still under development.

>
> WRT to 'fitting in' Commons Collection, I would like someone else to reply
> (please ;-)
> - Does the BF code implement some Commons Collection interfaces?
>
No

> - Does the BF code extend  some Commons Collection classes?
>
No

> - Does the BF code reuse Commons Collection utilities?
>
No

It is totally separate. It uses a hash function from Commons Codec (which
Claude kindly fixed as the MurmurHash3 function did not compute the correct
hash).

>
> Tangent: There were UML tools/Maven plugins in the past that could
> visualize these kinds of relationships, has anyone used any in the recent
> past? Anything FOSS we could use?
>

No need.

git grep commons.collections4
src/main/java/org/apache/commons/collections4/bloomfilter/

The only items that are imported are those below the bloomfilter package.

Answers to Claude's point below:

> On Wed, Apr 22, 2020 at 7:00 AM Claude Warren <cla...@xenei.com> wrote:
>
> > Bloom filters should not use generics.  That has been my stated opinion.
> > They are not like other collections in that you don't get out what you
> put
> > in.  They are collections of hashes so the idea that generics should be
> > used to somehow define what goes in is misleading.
>

That is fine. If we set out that is the architectural decision in the
package docs then it is clear what the package is for.

Currently we have a very tight coupling between the BloomFilter, Hasher,
Shape and HashFunctionIdentity classes. This leads me to think there should
be level above this that allows construction of a class containing all the
required components to function as a collection:

ObjectBloomFilter<T> filter = BloomFilters.create(...)

You would pass in specifications for the component parts providing a simple
use case API. Advanced usage can work with the low level components.

> >
> > If commons-collections is supposed to be all about putting things into a
> > collection and getting them back out then perhaps bloom filters do not
> > belong here.
>

Yet. Since we are in development and perhaps have a bit further to go to
get to a collection of items.

> >
> > The only major point of contention is should the package end up looking
> > like the Guice bloom filter package.  In my opinion the Guice package is
> > very restrictive.  It does not allows for different hashing
> > implementations, it forces a one to one correspondence between an Object
> > and the filter, this makes putting properties of an Object into the
> filter
> > as separate items difficult if not impossible, and it makes creating
> > partial filters for matching the same still more difficult.
>

Note you can always work around some issues with properties of objects by
having the BloomFilter typed to a control Object or super Object of things
you want to add. The conversion of the Object to indices can respond
appropriately based on the object type. But given we have not yet included
partial filters I don't know exactly how it would work.

> >
> > Any more complex usage of Bloom filters (e.g. in genomics) will be much
> > harder if not impossible.
> >
> > Going the Guice route also begs the question: Why not just use Guice?
>

I agree that Guava's implementation does one use case very well, but does
not allow configuring the hash function or other filter functionality. We
should not underestimate the number of people who want to do the same but
have a bit more control over the hash function. This is a use case we
should try to achieve in the end product.

> >
> > The intention of this contribution was a framework that allows the
> > developer to build Bloom filters that
> > a) met the requirements of the application.
> > b) were easy to share between applications.
> > c) could implement most strategies for Bloom filters.
> >
> > The Guice implementation is easy enough to construct with the framework
> as
> > defined in the current Commons Collections Bloom filter code.  And I have
> > no objection to providing a Simple Bloom filter implementation that does
> > that.  But doing so should not modify the framework in such a way as to
> > make other usage more difficult.
>

Thus we build on top of it. But we should maintain that the entire API is
fluid until we have that functionality in place too. If we expect and allow
changes to the API then the development tag is still relevant.

> >
> > There have been lots of good conversations and lots of improvements since
> > the code was contributed.  I have several open source projects that
> > utilized the original code and have been able to easily modify them to
> use
> > the Commons versions as development and improvements progressed.  I would
> > hope to continue to be able to do that as the code moves to a releasable
> > state.
>

It is helpful to know that the code has improved.

> >
> > As I said above, it may be that commons collections is not the place for
> > Bloom filters.  Perhaps they belong in codec or in its own project.  I
> > leave that to others to decide.
>

This is the crux. Currently we have a structure that collects bits and
provides merge and query based on bits. It uses nothing from the rest of
commons collections. But a second layer can be added that provides a
collections type API on top of the current code.

Alex

Re: [BloomFilters] changes to BloomFilter

Reply via email to