Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

Claude Warren Wed, 16 Oct 2024 06:53:48 -0700

search is a matching operation against contained filters ... I don't think
we want to call it match as that is then confused with the standard Bloom
filter match.
locate is an exact match -- byte for byte the same.  Technically search
results can be filtered for locate but in some implementations locate can
be faster than filtering search.


On Sat, Oct 12, 2024 at 1:21 PM Gary Gregory <garydgreg...@gmail.com> wrote:

> Thank you for explaining multidimensional filters here. It's not clear to
> me yet if we need to introduce the new interface. I'll have to reread the
> great docs you wrote 😀
>
> The search() vs. locate() names are a bit confusing. Are these BF naming
> conventions? If not, and if the difference is the exactitude of the match,
> then using "exact" in the method name might be better. For example, Java
> uses this naming in
>
> https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/Math.html#addExact(long,long)
>
> An alternative would be to define a MatchType enum with 2 values EXACT and
> ... whatever the other name should be. I find enums easier to read than a
> boolean but that's also a possibility. Then you'd have a single search
> method with a MatchType argument (or a boolean). Just a thought.
>
> TY!
> Gary
>
> On Sat, Oct 12, 2024, 5:07 AM Claude Warren <cla...@xenei.com> wrote:
>
> > > ...
> > > We can delay MultidimensionalBloomFilter unless introducing it later
> > would
> > > break binary compatibility. How beneficial would introducing this
> > interface
> > > for users?
> > >
> > > Gary
> > >
> > >
> > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren <cla...@xenei.com> wrote:
> > >
> > > > This is starting tondelve into the realm of multidimensional bloom
> > > filters.
> > > ...
> > > > We could define an interface MultidimensionalBloomFilter that could
> be
> > > > implemented by any Bloom filter that comprises multiple filters and
> > givr
> > > it
> > > > the method flatten().  But this is a can of worms I would like to
> keep
> > > > sealed for a bit longer.
> > > >
> > > > Claude
> > >
> >
> > I have been meaning to answer this question for a week now but time
> always
> > seems to get away and it is a rather long discussion.
> >
> > Multidimensional Bloom filters are basically any collection of Bloom
> > filters. However they have very different capabilities and usages. I
> wrote
> > an introduction to Multidimensional Bloom filters (
> >
> >
> https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md
> > )
> > but I can’t seem to find it online. Perhaps it does not get published to
> > the site, though the site descriptions are out of date.
> >
> > I identify two categories of Bloom filters: Gatekeeper and
> > Representational.
> >
> > Gatekeeper is the standard use where the key for an item is hashed and
> > placed in a Bloom filter and when an expensive lookup is required the
> Bloom
> > filter is consulted first to see if the lookup should be performed.
> >
> > Representational filters encode properties of objects and are then used
> to
> > find objects with the same properties. This use case is found in
> genomics,
> > encrypted database searching, and some proposals for performing pattern
> > matching in the presence of wildcards. In the representational case the
> > Bloom filter is associated with a data record of some sort.
> >
> > Multidimensional Bloom filters can contain either category.
> >
> > As a couple of examples:
> >
> > Example 1: Tracking ageing records on disk
> >
> > A streaming data system tracks active records by storing them on disk for
> > an hour. Every 5 minutes it flushes any record that is older than 1
> hour. A
> > layered Bloom filter (a type of Multidimensional Bloom filter) is
> > constructed and a new layer added whenever one becomes full (the max
> number
> > of items have been added) or 5 minutes has elapsed (therefore they must
> > have timestamps associated with them). Bloom filters are removed whenever
> > their timestamp expires.
> >
> > When a record is to be added the layered Bloom filter is checked to see
> if
> > the object is on the disk. If not, it is added. If so, the disk is
> searched
> > for the record. Note, that if each Bloom filter in the layered Bloom
> filter
> > is associated with a separate file then identifying the layers that match
> > the data identifies which files to search further reducing the search
> time.
> > In this case the layers are gatekeeper filters.
> >
> > Example 2: locating potential pattern matches in the presence of
> wildcards
> >
> > A system of patterns is stored where each pattern may have a wildcard
> > matching one or more characters. Let’s assume the wildcard is ‘*’ and the
> > patterns are restricted to upper case Latin characters A-Z, space, ‘*’,
> and
> > digits 0-9. When a pattern is added to the system every pair of
> characters
> > in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
> > the pattern. The Bloom filter is then placed in a multidimensional Bloom
> > filter.
> >
> > Queries are then textual strings for which potential matches are located.
> > So the query is hashed in the same way and each Bloom filter in the
> > multidimensional Bloom filter is checked to see if it matches. If it does
> > then the associated pattern is checked to see if it matches.
> >
> > So the string “how now brown cow” would yield a Bloom filter that was the
> > combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
> > pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
> > “co” thus it would match the Bloom filter for the phrase. This is similar
> > to how some genomic searching works.
> >
> > Example 3: Locating objects with specific properties
> >
> > In this example assume a number of objects with common properties names.
> We
> > construct a Bloom filter for each object. The Bloom filter comprises the
> > hashes of the values of the properties (potentially with different seeds
> to
> > differentiate between the property names). We write the objects into a
> data
> > storage and associate the storage with their Bloom filters. The Bloom
> > filter is then placed into a multidimensional Bloom filter.
> >
> > If we want to find all the objects where X=five and Y=blue and
> Z=yesterday
> > we simply generate the hashes and create the Bloom filter and then search
> > the multidimensional Bloom filter for matches. This is the solution found
> > in searching encrypted data, where the clear text values are hashed and
> the
> > encrypted data written to storage.
> >
> >
> > So we have multidimensional Bloom filters that store gatekeeper type
> > filters (example 1) and others that store representational (example 3).
> > Example 2 seems to fall somewhere on the continuum between the two.
> >
> > In almost all cases you need to be able to search the filter for matching
> > filters. In some cases you need to flatten the representation down to a
> > Gatekeeper style. The multidimensional Bloom filter documentation noted
> > above discusses several implementations of multidimensional Bloom
> filters.
> >
> > In defining a MultidimensionalBloomFilter interface we would have to
> > consider the various uses. It seems that either we assume the filters
> being
> > added to the multidimensional filter are WrappedBloomFilters so that they
> > contain the additional data, or when they are added the user may specify
> an
> > associated Object. Perhaps we define a RepresentationalBloomFilter<T>
> that
> > implements BloomFilter and stores a <T> object. Then this can be added to
> > the multidimensional filter.
> >
> > In any case the multidimensional filter needs to be able to search for
> > matching filters that are stored within and return some sort of index.
> >
> > I have seen implementations of multidimensional Bloom filters that act
> like
> > a database into which a Bloom filter is stored.  When a filter is stored
> an
> > integer “index” is returned. The developer then uses that to store the
> > object in a database or similar repository. When searching, zero or more
> > matching indexes are returned. Deletion requires that the index be
> passed.
> >
> > So we perhaps could implement a multidimensional filter that looks like
> > this:
> >
> > interface MultidimensionalBloomFilter {
> >
> > int add(BloomFilter filter);
> >
> > int[] search(BloomFilter filter);
> >
> > int[] locate(BloomFilter filter);
> >
> > boolean delete(int idx); // return true if there was a change
> >
> > BloomFilter flatten();
> >
> > }
> >
> > The difference between the search() and locate() methods is that search()
> > does the Bloom filter matching while locate only returns exact matches.
> I
> > don't know if this is necessary but it may make some housekeeping much
> > easier.
> >
> > Once the filters enter the multidimensional Bloom filter they are no
> longer
> > visible to the outside, that is this interface does not provide a
> mechanism
> > to return the Bloom filter for an index – I don’t know if that is
> > necessary.
> >
> > I am reluctant to add data to the Bloom filters within the
> > MultidimensionalBloomFilter because many of the implementations will
> > probably have to store the filter representation on disk and serializing
> an
> > associated arbitrary data object seems a bit much and outside the scope.
> >
> > I hope this clarifies the problem space and will bring clarity to any
> > follow on discussions.
> >
> >
> > Claude
> >
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

Reply via email to