search is a matching operation against contained filters ... I don't think we want to call it match as that is then confused with the standard Bloom filter match. locate is an exact match -- byte for byte the same. Technically search results can be filtered for locate but in some implementations locate can be faster than filtering search.
On Sat, Oct 12, 2024 at 1:21 PM Gary Gregory <garydgreg...@gmail.com> wrote: > Thank you for explaining multidimensional filters here. It's not clear to > me yet if we need to introduce the new interface. I'll have to reread the > great docs you wrote 😀 > > The search() vs. locate() names are a bit confusing. Are these BF naming > conventions? If not, and if the difference is the exactitude of the match, > then using "exact" in the method name might be better. For example, Java > uses this naming in > > https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/Math.html#addExact(long,long) > > An alternative would be to define a MatchType enum with 2 values EXACT and > ... whatever the other name should be. I find enums easier to read than a > boolean but that's also a possibility. Then you'd have a single search > method with a MatchType argument (or a boolean). Just a thought. > > TY! > Gary > > On Sat, Oct 12, 2024, 5:07 AM Claude Warren <cla...@xenei.com> wrote: > > > > ... > > > We can delay MultidimensionalBloomFilter unless introducing it later > > would > > > break binary compatibility. How beneficial would introducing this > > interface > > > for users? > > > > > > Gary > > > > > > > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren <cla...@xenei.com> wrote: > > > > > > > This is starting tondelve into the realm of multidimensional bloom > > > filters. > > > ... > > > > We could define an interface MultidimensionalBloomFilter that could > be > > > > implemented by any Bloom filter that comprises multiple filters and > > givr > > > it > > > > the method flatten(). But this is a can of worms I would like to > keep > > > > sealed for a bit longer. > > > > > > > > Claude > > > > > > > I have been meaning to answer this question for a week now but time > always > > seems to get away and it is a rather long discussion. > > > > Multidimensional Bloom filters are basically any collection of Bloom > > filters. However they have very different capabilities and usages. I > wrote > > an introduction to Multidimensional Bloom filters ( > > > > > https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md > > ) > > but I can’t seem to find it online. Perhaps it does not get published to > > the site, though the site descriptions are out of date. > > > > I identify two categories of Bloom filters: Gatekeeper and > > Representational. > > > > Gatekeeper is the standard use where the key for an item is hashed and > > placed in a Bloom filter and when an expensive lookup is required the > Bloom > > filter is consulted first to see if the lookup should be performed. > > > > Representational filters encode properties of objects and are then used > to > > find objects with the same properties. This use case is found in > genomics, > > encrypted database searching, and some proposals for performing pattern > > matching in the presence of wildcards. In the representational case the > > Bloom filter is associated with a data record of some sort. > > > > Multidimensional Bloom filters can contain either category. > > > > As a couple of examples: > > > > Example 1: Tracking ageing records on disk > > > > A streaming data system tracks active records by storing them on disk for > > an hour. Every 5 minutes it flushes any record that is older than 1 > hour. A > > layered Bloom filter (a type of Multidimensional Bloom filter) is > > constructed and a new layer added whenever one becomes full (the max > number > > of items have been added) or 5 minutes has elapsed (therefore they must > > have timestamps associated with them). Bloom filters are removed whenever > > their timestamp expires. > > > > When a record is to be added the layered Bloom filter is checked to see > if > > the object is on the disk. If not, it is added. If so, the disk is > searched > > for the record. Note, that if each Bloom filter in the layered Bloom > filter > > is associated with a separate file then identifying the layers that match > > the data identifies which files to search further reducing the search > time. > > In this case the layers are gatekeeper filters. > > > > Example 2: locating potential pattern matches in the presence of > wildcards > > > > A system of patterns is stored where each pattern may have a wildcard > > matching one or more characters. Let’s assume the wildcard is ‘*’ and the > > patterns are restricted to upper case Latin characters A-Z, space, ‘*’, > and > > digits 0-9. When a pattern is added to the system every pair of > characters > > in [A-Z0-9] is hashed and added to a Bloom filter that is associated with > > the pattern. The Bloom filter is then placed in a multidimensional Bloom > > filter. > > > > Queries are then textual strings for which potential matches are located. > > So the query is hashed in the same way and each Bloom filter in the > > multidimensional Bloom filter is checked to see if it matches. If it does > > then the associated pattern is checked to see if it matches. > > > > So the string “how now brown cow” would yield a Bloom filter that was the > > combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The > > pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, > > “co” thus it would match the Bloom filter for the phrase. This is similar > > to how some genomic searching works. > > > > Example 3: Locating objects with specific properties > > > > In this example assume a number of objects with common properties names. > We > > construct a Bloom filter for each object. The Bloom filter comprises the > > hashes of the values of the properties (potentially with different seeds > to > > differentiate between the property names). We write the objects into a > data > > storage and associate the storage with their Bloom filters. The Bloom > > filter is then placed into a multidimensional Bloom filter. > > > > If we want to find all the objects where X=five and Y=blue and > Z=yesterday > > we simply generate the hashes and create the Bloom filter and then search > > the multidimensional Bloom filter for matches. This is the solution found > > in searching encrypted data, where the clear text values are hashed and > the > > encrypted data written to storage. > > > > > > So we have multidimensional Bloom filters that store gatekeeper type > > filters (example 1) and others that store representational (example 3). > > Example 2 seems to fall somewhere on the continuum between the two. > > > > In almost all cases you need to be able to search the filter for matching > > filters. In some cases you need to flatten the representation down to a > > Gatekeeper style. The multidimensional Bloom filter documentation noted > > above discusses several implementations of multidimensional Bloom > filters. > > > > In defining a MultidimensionalBloomFilter interface we would have to > > consider the various uses. It seems that either we assume the filters > being > > added to the multidimensional filter are WrappedBloomFilters so that they > > contain the additional data, or when they are added the user may specify > an > > associated Object. Perhaps we define a RepresentationalBloomFilter<T> > that > > implements BloomFilter and stores a <T> object. Then this can be added to > > the multidimensional filter. > > > > In any case the multidimensional filter needs to be able to search for > > matching filters that are stored within and return some sort of index. > > > > I have seen implementations of multidimensional Bloom filters that act > like > > a database into which a Bloom filter is stored. When a filter is stored > an > > integer “index” is returned. The developer then uses that to store the > > object in a database or similar repository. When searching, zero or more > > matching indexes are returned. Deletion requires that the index be > passed. > > > > So we perhaps could implement a multidimensional filter that looks like > > this: > > > > interface MultidimensionalBloomFilter { > > > > int add(BloomFilter filter); > > > > int[] search(BloomFilter filter); > > > > int[] locate(BloomFilter filter); > > > > boolean delete(int idx); // return true if there was a change > > > > BloomFilter flatten(); > > > > } > > > > The difference between the search() and locate() methods is that search() > > does the Bloom filter matching while locate only returns exact matches. > I > > don't know if this is necessary but it may make some housekeeping much > > easier. > > > > Once the filters enter the multidimensional Bloom filter they are no > longer > > visible to the outside, that is this interface does not provide a > mechanism > > to return the Bloom filter for an index – I don’t know if that is > > necessary. > > > > I am reluctant to add data to the Bloom filters within the > > MultidimensionalBloomFilter because many of the implementations will > > probably have to store the filter representation on disk and serializing > an > > associated arbitrary data object seems a bit much and outside the scope. > > > > I hope this clarifies the problem space and will bring clarity to any > > follow on discussions. > > > > > > Claude > > > -- LinkedIn: http://www.linkedin.com/in/claudewarren