Re: [collections] BloomFilter package architecture discussion

Gilles Sadowski Sat, 19 Oct 2019 08:48:27 -0700

Hello.

2019-10-19 16:20 UTC+02:00, Claude Warren <cla...@xenei.com>:
> On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski <gillese...@gmail.com>
> wrote:
>
>> Hi.
>>
>> >>> [...]
>> > >
>> > > Maybe I was not clear enough: I'm not saying that we should prefer
>> > > some representation (of the state) over another; only that the how
>> > > the state is represented externally need not be the same as the
>> internal
>> > > representation.
>> > >
>> > > But if the state is fully encapsulated by a "BitSet", then internal
>> > > and
>> > > external can be the same, making easier both for developers and for
>> > > users.  So let's have indeed a "getBitSet()" that returns a copy of
>> > > the
>> > > (private final) instance field (of type "BitSet") used to represent
>> > > the
>> > > state of a "BloomFilter".
>> > >
>> > >
>> > I agree with most of this statement. However, I think the use of BitSet
>> for
>> > the internal representation is questionable.
>>
>> Hence, that settles the issue of "class" versus "interface":
>> "BloomFilter"
>> must be an interface.
>>
>> *I am still not convinced. How do you then provide consistent
>> implementations of the distance functions? We could use “default”
>> implementations but that seems like a misuse of the intent for “default”,
>> though perhaps I misunderstand that.*


What's the issue with having "distance(BloomFilter other)"
in the interface?

>>
>>
>> [snip]
>
>> > In addition, there are researchers that do not use
>> > BitSet because it is too slow or too bloated.
>>
>> Important point.
>> [I thought that "BitSet" was evoked because it perfectly fitted the
>> bill.]
>>
>
> BitSet fits the bill in terms of required data read functionality.

Yes; but not in terms of performance for certain use cases.
Hence why it cannot be a common denominator for all
implementations.

>
> [snip]
>
>> > We can use the
>> > Implementation in our classes and use the interface where possible but
>> > it
>> > leaves open the possibility of switching in a version of RoaringBitmap
>> > or
>> > similar without changing the external API. It is this interface that is
>> the
>> > “external reference implementation of state”. I think I would like to
>> call
>> > this interface BitSetI and will refer to it this way later. I think it
>> > should contain the following methods as defined in BitSet: and(),
>> andNot(),
>> > cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
>> > nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(),
>> xor().
>>
>> IMHO, this proposal orthogonal to the "BloomFilter" functionality, and
>> we must discuss separately whether [Collections] wants to host such an
>> API.
>>
>
> I think a core question needs to be answered: Is a Bloom filter an
> extension of functionality on a BitSet (isA) or does it simply use the
> functionality of a BitSet (hasA).

AFAICT, neither.

> If find myself of two minds on this
> quesiton at the start, but seem to come down on the isA side after analysis
> of how the Bloom filter is used. To that end BloomFilter should provide the
> data read methods of the BitSet functionality. What I called BitSetI
> before.

"data read" and "BitSetI" are quite different.
As said before, the former is implementation-dependent and a
user's business (as long as the developers provide a suitable
accessor of course).

>>
>> A priori, how the classes that implements the "BloomFilter" interface
>> represents the "set of bits" is an implementation detail.  Even that
>> there
>> is a "set of bits" is a detail to *users* of the interface.
>> Of course, a practical question is whether the representation should be
>> "pluggable".  If not, we could change to using one or another from those
>> classes that you mentioned (and change it from version to version).
>> In any case, I think that this would be a feature to be left for later
>> (and
>> that could be added in a compatible way, if nothing of the "state" is
>> leaked
>> into the public API).
>>
>
> To my mind the “set of bits” interface is what I called BitSetI. Again,
> this defines how the Bloom filter presents itself as a “set of bits” not
> how it implements it.

IMO, the "BitSetI"/"set of bits" is an implementation *detail*.
IOW: a "BloomFilter" is *not* a "set of bits" but *can* be represented
as one.

Under the hood, it of course needs some kind of "set of bits" to perform
its operations (i.e. those methods that defines the public API), but that
certainly does not mean that an application developer would be well
advised to use a "BloomFilter" in order to represent, say, a number
(just because a number could also be viewed as a "set of bits")...

>
> [snip]
>
>> > The implementation may not want to use the BitSet class because the
>> filter
>> > itself can be large. For example in DNA matching it is not uncommon for
>> the
>> > Bloom filter to be 100s of MB in size. This also means we don’t want to
>> > make copies if we don’t have to. (perhaps I should add DNA matching to
>> the
>> > Use cases to capture the large number of bits).
>>
>> So this use case would entail that the "BloomFilter" interface must *not*
>> mandate a "toBitSet()" method.
>>
>
> Agreed.
>
>
>> > > [...]
>> > >
>> >
>> > I agree with most of this statement. I still don’t like forcing a
>> standard
>> > BitSet implementation onto all instances of Bloom filter
>>
>> +1
>>
>> > when it is the
>> > logical construct (BitSetI) that we are looking for.
>>
>> -1
>> [IIUC; see my comments above.]
>>
>
> Somehow I think we are in violent agreement here, but perhaps are talking
> past each other some how.

Yes, and no.  We are indeed converging, but I more and more
convinced that the concept of "BloomFilter" should not be based
on how we think serialization should work or how to practically
implement the API.

We must start from the "outside view" of the API, i.e. define what
services the "BloomFilter" must provide.

>> > >
>> > > > There are several other examples such as layered Bloom
>> > > > filters (
>> > > https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
>> > > > and attenuated Bloom filters (
>> > > > https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
>> > > where
>> > > > the internal structure deviates from the BitSet model but that
>> > > > still
>> need
>> > > > to present themselves as AbstractBloomFilters do. I think that
>> Abstract
>> > > > class is the correct pattern here.
>> > >
>> > > Not sure: "abstract" is adequate when several subclasses need to
>> > > override behaviours; whereas, here the behaviour is the same but
>> > > the state is expanded.
>> > >
>> > >
>> > I think abstract because the methods to perform the calculations that
>> > are
>> > currently in BloomFilterFunctions would be implementable against
>> > BitSetI.
>> > Those are the functions that go into the abstract class (along with
>> BitSetI
>> > methods).
>>
>> Ditto: these are details from the POV of "BloomFilter" users; of course,
>> we need code that will do the work on the whatever represents the "sets
>> of bits", but at first sight I don't think that we need to advertise a
>> "BitSetI"
>> (even a "protected" one) in an abstract class, because it's unlikely that
>> any and all existing external implementations of "set of bits" will
>> comply
>> with an interface defined here.
>>
> Reading through various implementations of Bloom filters as well as
> papers
> describeing new types of Bloom filters and ways to index them and ways to
> extract data from them there are a set of low level bit operations (xor,
> or, cardinality, and, cardinality of the results of xor, or and and) that
> if supported make all that work possible. I will create a separate document
> that outlines what those are so that we can discuss them in detail.

You are certainly more aware than I am of all the "services" that
could be necessary in order to care for all the interesting use cases.
But how about focusing on the *minimal* API first?
What I suggest is to start from what you initially presented as the
"BloomFilter" interface ("match", "merge", etc.), and make that
work based on the layout ideas contained the preliminary Java code
posted on JIRA (issue COLLECTIONS-728).

>
>  [snip]
>
>>
>> > I don’t know of any such usage, however, nesting Bloom filters is done
>> > in
>> > Attenuated filters and Layered Bloom filters use a number of Bloom
>> > filter
>> > implementations internally.
>>
>> What do those use cases require from the "BloomFilter" interface?
>>
>
> What those implementations require are the standard match() and merge() as
> well as some love level bit operations (see aboce). I will add this
> analysis to the separate document I committed to above.

As per my preceding comment, I think that we should make the
basics as simple as simple, and show that it delivers the intended
functionality.
Then, more advanced use cases might well require addition to the
API, but it will be clear for everyone why the are needed (based on
said use-cases).
In effect, maybe some of the additional requirements could be
defined in an *extension* of the "BloomFilter" interface; a
"MutableBloomFilter" interface naturally comes to mind.
>From what you said above, there might also be
---CUT---
interface BloomFilterBitSetOperations extends BloomFilter {
    BloomFilterBitSetOperations and(BloomFilterBitSetOperations other);
    BloomFilterBitSetOperations xor(BloomFilterBitSetOperations other);
}
---CUT---

Some other functionalities (mostly the recurrent issue of serialization)
might be readily solved at the level of specific implementations; e.g.
the "BitSetBloomFilter" which I referred to before will certainly
provide "getBitSet()" and "fromBitSet(BitSet)" methods that are
sufficient for moving instances around.

>> > > > [...]
>> > >
>> > Number of bits is not implicitly conveyed by getBitSet() as the BitSet
>> will
>> > only grow to the size needed to capture the highest bit currently on.
>> > So
>> > the number of bits may be 72 but the highest on bit might be 50.
>>
>> OK; doesn't matter now since a common "BitSet" state has been
>> ruled out.
>> I get that "numberOfBits()" is one of the methods defined by
>> the "BloomFilter" interface.
>>
>> >
>> > Hash, number of functions (k), number of items (n), and number of bits
>> (m)
>> > define the system in which the filter is expressed, call this the
>> “Shape”.
>> > It is not always necessary to convey the Shape when conveying the bloom
>> > filter.
>>
>> Well, if it is *ever* necessary, then "BloomFilter" must declare a
>> "shape()"
>> method.  Hence, it could be worth defining a "Shape" subinterface (?).
>>
>>
> See other discussion thread on shape.

I've used it to arrive to the "BloomFilter.java" proposal attached
on JIRA.

>
> [snip]
>
>> >
>> > > I'm lost here; I don't know what is "the" reference state, since a
>> > > single structure cannot for example represent both a standard
>> > > filter and a counting filter...
>> > >
>> > >
>> > Think BitSetI.
>>
>> Still: No.
>> IMHO. ;-)
>>
> I meant that the bit set functions defined by BitSetI define the state of
> a standard Bloom filter. Since all Bloom filter types are and
> implementation of the concept of Bloom filter, all of them must be able to
> provide the data requested from the BitSetI functionality.

I get the rationale, but I'm not able to see that in the textual
description of use cases.

Please note that I wish to avoid "preemptive" bloating of the API,
which, after the fact, is nearly impossible to fix in "Commons"
codes.
[Among other problems, this ever increasing size of the codebase,
despite the identified issues, is what led to "Commons Math"
unfortunate fate: specific case in point in the matrix functionality
there, whose bloated interface discouraged potential contributions
targeting at 3D geometry.]

>> IIUC, for a Bloom filter,
>>     "state" = "shape" + "set of bits"
>>
>> *See separate shape discussion.*
>
>
>> > > >
>> > > >> > In addition, the abstract Bloom filter might implement the “data
>> > > >> > read” methods from BitSet directly. This would allow more
>> efficient
>> > > >> > implementation of several methods that currently depend upon the
>> > > >> > creation
>> > > >> > of a BitSet and the manipulation of same.
>> > > >>
>> > > >> Sure; if any and all variants of a "BloomFilter" implementation
>> > > >> can
>> > > >> be created from a "BitSet", then such a factory method could be
>> > > >> part of the "Builder" interface:
>> > > >>
>> > > >> public interface BloomFilter {
>> > > >>     // Core methods...
>> > > >>
>> > > >>     public interface Builder {
>> > > >>         public Builder from(BitSet s); // Factory method.
>>
>> As above, for large filters, this probably is not a good requirement.
>> Should it be
>>    public Builder from(BooleanStream s);
>> ?
>> ["BooleanStream" is not provided by the JDK...]
>>
>
> BooleanStream is not but IntStream is, in this case we define the integer
> stream to be the indexes of the bits that are on in the filter.  I have
> seen this in some implementations.

Maybe not the most efficient way to convey the state if more
than 1/32 of the bits are set.

>> > > >>         public BloomFilter build();
>> > > >>         // Other builder methods.
>> > > >>     }
>> > > >> }
>> > > >>
>>
>
> [snip]
>
>
>> > One final issue. There was recently a post on the pull request asking
>> > why
>> > the Bloom filter is an immutable object. We should probably explore why
>> or
>> > why not.
>>
>> Part of the answer is universal: because it is trivial to ensure a
>> consistent
>> state (even in the face of concurrent usage).
>> The other part of is the justification, which you gave on GitHub, based
>> on
>> expected usage.
>>
>> > The biggest argument against is that the merge() method would requires
>> that
>> > the data be duplicated and it makes it hard to do in place updates. For
>> > example when updating a “bucket” filter (as defined in the Use cases).
>>
>> Now that "BloomFilter" is going to be a interface, some of its
>> implementations
>> could be mutable (e.g. if it would be a requirement for its target
>> usage).
>>
>> However, arguments about performance should be backed with benchmarks...
>>
>> In general, I think that code will be more robust if low-level classes
>> are
>> immutable; and higher-level is responsible for managing instances (i.e.
>> replacing them if outdated).
>>
>> > But if it is not immutable then the hashCode value changes making it
>> > unsuitable for Hash sets. I think that there are classes in Commons
>> > that
>> > have similar restrictions, and I think that documenting the issue in
>> > the
>> > class javadoc as well as the hashCode() and equals() (assuming we
>> actually
>> > override them) would be acceptable.
>>
>> Are there use cases where Bloom filters would be stored in a "HashSet"?
>>
>>
> I had a use case in an index implementation but it can be worked around.
> Having the filter be mutable seem to be a common implementation. I am
> leaning in this direction now because immutability makes some usage more
> compex without a strong benefit.

I'd tend to disagree, especially if the simple/basic implementations
have some uses in multi-threaded scenarios.
In particular, the copying of small objects will probably be more efficient
than mutating them.

> But I can go either way.

With "BloomFilter" as an interface, we can go both ways: just add a
new implementation when a use-cases justifies it (through benchmarks).

Best regards,
Gilles

>
> All the best,
> Claude
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [collections] BloomFilter package architecture discussion

Reply via email to