Re: [collections] BloomFilter package architecture discussion

Gilles Sadowski Mon, 14 Oct 2019 17:46:59 -0700

Hello.

Le lun. 14 oct. 2019 à 14:12, Claude Warren <cla...@xenei.com> a écrit :
>
> Greetings,
>
>
> I feel like I have been beating my head against a wall trying to get this
> contribution accepted;


In my experience, a "wall" would be a plain refusal (or no answer
and nobody applying the patch).  We are far from that.
I'm trying to get to a point where I'd feel confident to apply it.  [And
hopefully, seasoned contributors to [Collections] should chime in
before that.]

> this is not a complaint about process or
> personalities, just a statement of how I feel. I realize that I have made
> progress but I also realize that there is a long way to go.

Maybe, maybe not.  But IMHO, it takes longer because there is
too much code submitted at once; and it's not easy to review
because I don't know the concept or its applications.

> Furthermore,
> the other potential users and supporters have not responded to any
> communication about this proposal so I am floundering on that front too.

Who are they?

> What I want to do is take a step back. Let’s remove the collection packages
> from consideration and work out the base package first.

+1

> With that in mind I
> would like to present reasoning behind the classes as they now stand and
> suggest a way forward to get the core classes in before adding the
> collections.

For sure, I appreciate your willingness to provide more explanations. :-)

> As discussed earlier Bloom filters are a concept. In most peoples mind’s
> they are bit vectors with specific named functions (e.g. match => A and B =
> A, hammingValue => cardinality(this)). This idea of the filter as a bit
> vector led me to using a BitSet. Originally the BloomFilter was
> serializable but valid arguments against that and toward making the
> internal representation visible enought to serialize that lead to making
> the BitSet public.

Maybe I'm missing your point, but I don't see a direct connection
between persistence of a "BloomFillter" instance and the visibility
of "BitSet" (if that is used to represent its internal state).

> In addition to the normal Bloom filter methods
> (hammingValue, match, etc.) there are some methods (log, cosine distance,
> jaccard distance, etc.) that are used by researchers and other attempting
> to do some work with indexing and such for large collections of filters.

IIUC, you indeed distinguish between core functionality (i.e. "normal
Bloom filter methods") and additional support for (?)...
If so, "core" should indeed be implemented first, with maximum
encapsulation (even if it would seem to prevent "research" usage).

> Finally there is the question of “normal” usage. In his paper on
> “Compressed Bloom Filters”[1] (an implementation we do not have yet)
> Michael Mitzenmacher states “the Bloom filter plays a dual role. It is both
> a data structure being used at the proxies, and a message being passed
> between them”. He is discussing the issue of using bloom filters in a
> distributed cache where the proxies distributed their bloom filter so that
> all the proxies know which proxy might have the document. So the Bloom
> filter needs both an internal representation and a mechanism to pass the
> data between systems that use it. It seems obvious that the BitSet works as
> the mechanism to pass the data between systems, though an immutable version
> would be preferred.

So it seems we can have a Bloom filter state represented by a
"private" and "final" instance of "BitSet".  Is there any reason that
the implementation in [Collections] would ever need to use several
representations of its state?
If not, I'd think that "BloomFilter" can be a concrete class (with, for
now, a "BitSet" as its state).

> So the Bloom filter needs an external reference implementation of its
> state, so that it can communicate said state with other Bloom filters. The
> quesiton of what is necessary for transmission of that state is still open.
> I think we can agree a mechanism to identify which bits are on is requried
> as part of the state. I have seen discussions that indicate the hashing
> algorithm and other configuration information should also be included, but
> I believe this is dependent upon where the filter is being sent. All Bloom
> filters within a system, by necessity, have to use the same hash and
> configuration parameters.

The last statement is not obvious to me, but it surely looks like
a simplifying assumption.
What are examples of "system" in that sentence?

> Only when sending outside the system does the
> other data need to be made available. In my mind this means those data
> structures have to be availabe to be transmitted along with the Bloom
> filter or filters.

Even if so, this is about serialization/persistence; thus, not a "core"
issue.
If the above "system" is the application level, all "BloomFilter" instances
should be able to pass messages between them using instance methods.

> So now to the code in the current contribution. The argument that
> BloomFilter should be a class not an interface is a good one. However, as
> the internal representation of the bit structure can vary wildly,

Why should it?

> I think
> it should be an abstract class. My reasoning is as follows: Assuming we
> have an external reference implementation of state many/all of the methods
> could be implemented against that definition.

I don't follow: OO is about behaviour(s), "state" being an implementation
detail.

> This is what the current
> implementation attempts to do, but not successfully. Any implementation of
> the abstract class could develop an internal representation of state that
> is more efficient for whatever the use case is (e.g. counting Bloom
> filters, compressed Bloom filters, attenuated Bloom filters, etc)

Are these variants impossible to represent with a "BitSet"?

> and
> override the methods that are more efficiently answered within its internal
> structure.

Unless I'm missing the point, this then seems to mandate an
interface (rather than an abstract or concrete class), contrary
to what was said previously...

> In addition, the abstract Bloom filter might implement the “data
> read” methods from BitSet directly. This would allow more efficient
> implementation of several methods that currently depend upon the creation
> of a BitSet and the manipulation of same.

Sure; if any and all variants of a "BloomFilter" implementation can
be created from a "BitSet", then such a factory method could be
part of the "Builder" interface:

public interface BloomFilter {
    // Core methods...

    public interface Builder {
        public Builder from(BitSet s); // Factory method.
        public BloomFilter build();
        // Other builder methods.
    }
}

>
>
> I think this change yields the following:
>
>
> 1. Creation of an external reference implementation of state.
>
> 2. Conversion of BloomFilter for interface to AbstractBloomFilter class
>
> 3. Merge BloomFilterFunctions into AbstractBloomFilter
>
> 4. Rename StandardBloomFilter to BloomFilter (should we do this?)
>
> 5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the
> AbstractBloomFilter class.

It seems that somehow we diverge on the emphasis given to the
"state".  I think that the questions which I raised above should be
answered before jumping to those conclusions.

> On the topic of an external reference implementation of state, could this
> be an interface that AbstractBloomFilter implements? Might it comprise the
> “read data” methods from the BitSet class?

IIUC, that will trivially be the case if "BitSet" can be the exchange
format. And how to actually perform serialization is a user's issue
(i.e. whether to rely on "BitSet" being "Serializable", or to implement
a proxy...).

Regards,
Gilles

>
> [1] https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [collections] BloomFilter package architecture discussion

Reply via email to