Re: [BloomFilters] changes to BloomFilter

Alex Herbert Tue, 21 Apr 2020 14:31:14 -0700


> On 23 Mar 2020, at 10:13, Alex Herbert <alex.d.herb...@gmail.com> wrote:
> 
> 
> 
>> On 22 Mar 2020, at 17:44, Claude Warren <cla...@xenei.com 
>> <mailto:cla...@xenei.com>> wrote:
>> 
>> Sorry it has taken me so long to respond, but I spent a long time thinking
>> about this and started a response several times.
>> 
>> I see no reason not to go to long bit indexes.  This would be reflected in
>> Shape.getNumberOfBits returning a long as well as other places.
>> 
>> I would like to revisit some definitions and use cases before delving into
>> the suggestions.
>> Definitions
>> 
>>   -
>> 
>>   Bloom Filter
>>   -
>> 
>>      A set of bits and functions to query bit patterns.
>>      -
>> 
>>      Responsible for
>>      -
>> 
>>         Management of storage of the set of bits.
>>         -
>> 
>>         Producing a standard format external representation of the set of
>>         bits.
>>         -
>> 
>>         Importing a standard format external representation of the set of
>>         bits.
>>         -
>> 
>>         Performing specific bit queries on the set of bits:
>>         -
>> 
>>            Cardinality (number of bits enabled) of the filter.
>>            -
>> 
>>            Cardinality of a logical “and” with another filter.
>>            -
>> 
>>            Cardinality of a logical “or” with another filter.
>>            -
>> 
>>            Cardinality of a logical “xor” with another filter.
>>            -
>> 
>>            Determining if this filter “contains” another filter (this
>>            “and” that == that )
>>            -
> 
> Initially it is not apparent from the javadoc why you need all the 
> cardinality tests. We should add the primary purpose of these:
> 
> - AND cardinality can be used to determine if one filter is within another 
> filter when combined with the cardinality of the query filter. Is there 
> another purpose to it? 
> - OR cardinality will tell you how many bits would be taken after a merge 
> (union) of filters.
> - XOR cardinality will tell you how different the filters are.
> 
> Are any of these operations essential to have in BloomFilter? They can all be 
> implemented just using the long[] getBits(). Note that this has to be changed 
> to be implemented as a primitive iterator OfLong that returns 64 bits from 
> the filter starting at [0, 63] for all the bits when a filter number of bits 
> is changed from int to long.
> 
> Removing the combined cardinality tests reduces clutter in the main interface 
> and removes the requirement for a filter to implement them. It just 
> implements an efficient iterator of bits. These operations get moved to the 
> SetOperations class.
> 
>> 
>>         Merging another filter into this one. This is not technically
>>         required but was a design decision. The alternative for a filter to 
>> be
>>         immutable and to create a new filter from the merging of two filters.
>>         -
>> 
>>   Shape
>>   -
>> 
>>      The definition of a Bloom filter. It includes:
>>      -
>> 
>>         The maximum number of bits in the filter. (aka: m)
>>         -
>> 
>>         The expected number of items encoded in the filter. (aka: n)
>>         -
>> 
>>         The number of hashes generated for each item in the filter. (aka:
>>         t)
>>         -
>> 
>>      Responsible for providing the information necessary to identify when
>>      two filters are built with the same strategy.
>>      -
> 
> And as previously discusses the number of items is redundant other than to 
> provide the original intention when the shape was constructed.
> 
>> 
>>   Hasher
>>   -
>> 
>>      Converts one or more items into a Bloom filter based upon a Shape.
>>      -
>> 
>>      Is repeatable. (i.e. when called multiple times with the same shape
>>      produces the same output).
>>      -
>> 
>>      Responsible for:
>>      -
>> 
>>         Tracking the number of items being hashed.
> 
> Which currently is not in the API.
> 
>>         -
>> 
>>         Generating the number of hash values specified by Shape for each
>>         item.
> 
> Which breaks BloomFilter encapsulation.
> 
>> 
>> 
>> Types of usage Representation of a single simple object
>> 
>> This is probably the most common usage pattern if measured by number of
>> constructor calls. In this usage case a single object is converted to a
>> single item in a Bloom filter. A common use case would be to determine if a
>> URL has been seen. The filter is constructed by converting the hashing the
>> URL “t” times, each hash value converted to the range [0,m) and the
>> associated bits enabled in the Bloom filter.
> 
> OK.
> 
>> 
>> The single object filter is commonly used to test the presence of the
>> object in a collection of filters or to register the presence of the object
>> within the collection.
> 
> 
>> Representation of a single complex object
>> 
>> In this usage the Bloom filter represents a single object with more than
>> one property as contrasted to the simple object were a single property is
>> represented. In this the number of items in the Shape actually refers to
>> the number of properties for a single complex object time the number of
>> complex objects. So a collection of 1000 complex objects with 3 properties
>> each yields 3000 items in the Shape.
>> 
>> There are several hashing strategies used to build complex objects. The
>> simplest is to hash each property as though it were the property in a
>> single object. This works well if the individual properties being indexed
>> do not have any collisions across the representations. A more complex
>> solution is to seed the hash function with different values based upon the
>> property being indexed.
> 
> I cannot understand why you would do this. If I have a person with a first 
> name and last name field and I want to store 50,000 person objects you are 
> saying that I construct a Shape to store 100,000 items as I will generate k 
> indexes for first name and last name separately. This makes no sense to me. I 
> want to store 50,000 items. An item is something that can be hashed. So 
> combine the first and last name to a single set of bytes to hash and only 
> produce k indexes, not 2k indexes.
> 
> This adds a whole layer of complexity to constructing the Shape that you have 
> not put in your current API. I.e. when I construct a Shape I need not only 
> the number of items I am to store but the number of properties per item.
> 
> 
>> Representation of a single complete complex object
>> 
>> This use case is similar to the single complex object in that multiple
>> properties are represented as individual items in the resulting Bloom
>> filter. The difference is that the number of expected items, as defined by
>> the shape, is the number of properties in the object. Thus each object
>> produces a fully populated Bloom filter. This strategy is used when a
>> compact searchable representation of the object is required. This type of
>> filter is generally added to a Multidimensional filter (see below).
> 
> So if using BloomFilter<T> then T would be a property in a single object. 
> That is fine. Effectively you have BloomFilter<Object> representing a fast 
> look-up for all the properties in an item. Here the design I suggested is 
> less suitable as the IndexGenerator<Object> would have to handle all the 
> object types for all the properties. I think in the end though the amount of 
> code you have to write to work with it would be the same. Since you are not 
> storing items in a filter but all the properties of one item you construct 
> one by passing all the different properties to the filter merge(T) method. 
> You then query it using a property you are interested in. The code to 
> populate and query it would be similar with the current Hasher design or a 
> typed and encapsulated filter as you are mapping properties of lots of 
> different types to a single form to be put into the filter (probably a 
> byte[]).
> 
> 
>> Representation of a partial complex object
>> 
>> In this case the representation is built using the same method as the
>> single complex object but not all properties are specified. This is
>> commonly used to locate objects in a collection of complex object Bloom
>> filters.
> 
> This is not a case. It is just a sub-set of the other case.
> 
>> Representation of multiple objects
>> 
>> In this case multiple single-object Bloom filters are merged together to
>> create one Bloom filter representing all the objects. This is probably the
>> most common usage for filters that are persisted across application runs.
>> This implementation simply the merged bit patters of all the Bloom filters
>> being added. It is commonly used in applications like Hadoop to determine
>> which nodes might contain a specific file.
> 
> The basic use case. A filter contains many items.
> 
>> Multidimensional representation Bloom filters
>> 
>> This is a use case that extends the ability to store excess items (>”n”)
>> into a Bloom filter without significantly modifying the probability of
>> false positives. To do this efficiently the Multidimensional filter has to
>> know the Shape of the filters being inserted. In the most efficient cases
>> it requires the hasher so that it can build filters of different but
>> associated shapes. In addition, this type of filter can return ID’s
>> associated with the various filters. So that it in effect becomes an index
>> of objects associated with Bloom filters.
>> 
>> This type of filter is used in systems that have many (>1000) locations to
>> check for a Bloom filter match as they can perform the match much faster
>> than the linear search alternative.
> 
> IIUC this does not fit the current design either. A MultiDimensional filter 
> has more than one Shape?
> 
> Can you send code example or a link to an article on this? Basically more 
> explanation on how this works. I found this:
> 
> https://github.com/lemire/bloofi <https://github.com/lemire/bloofi>
> 
> Bloomfi = BloomFilterIndex
> 
> Which creates a special data structure that stores/collates filters. Each 
> filter is a type BloomFilter<T>. These all appear to have the same shape and 
> store the same things.
> 
> So this indexes many BloomFilters of the same shape.
> 
> What are you proposing that is beyond this? Are you suggesting that each 
> BloomFilter<T> is storing properties of one or more items. You would like to 
> create a query for the property x and return list of all the BloomFilters 
> that may contain an item that has that property? 
> 
> Can you post an interface API for a MultiDimensionalBloomFilter?
> 
> 
> 
>> Suggestions Move the hashing into the Bloom filter
>> 
>> Originally the Hasher was called a ProtoBloomFilter. There was a request
>> for a name change early on. Once a Hasher is created it can be used to
>> create any concrete Bloom filter. Currently this is done with the
>> getBits(Shape) method. Perhaps what is needed is a rework in this section.
>> 
>> If the Hasher had a method to return a LongSupplier for each item then the
>> Bloom filter constructor could be in control of the number of hash
>> functions that were called for each item.
>> 
>> So assume Hasher has 2 new methods:
>> 
>> int items(); // the number of items
>> 
>> LongSupplier supplier( int itemNumber )
>> 
>> 
>> 
>> Then BloomFilter(Hasher, Shape) constructor could call:
>> 
>> for (int i =0; i<hasher.items();i++) {
>> 
>> LongSupplier sup = hasher.supplier(i);
>> 
>> for (int t=0;t<shape.getNumberOfHashFunctions();t++)
>> 
>> {
>> 
>> enableBit( Math.floorMod( sup.getAsLong(), shape.getNumberOfBits() );
>> 
>> }
>> 
>> }
> 
> In the common use case of searching for 1 item this extra use of a loop over 
> 1 item seems a waste of time.
> 
> 
>> Remove the requirement for a full length byte[]
>> 
>> The original Hash.Factory was a simple implementation to allow multiple
>> items in the Hash.
>> 
>> We could create a DataSink interface along the lines you are talking about.
>> Then have Hash.Factory extend that so that all those base data types can be
>> used as items in their own right and add a method with(DataSink) that would
>> take all the items in the DataSink and combine them into a single item for
>> the Hasher.
> 
> The Hasher has to be the DataSink implementation. Have a look at Guava 
> BloomFilter<E> and see how it works. You specify what data from your item is 
> important by passing properties to the DataSink. This is actually a Hasher 
> that creates the hash code.
> 
>> Create an add(T)/contains(T) method pair or Make a simple BloomFilter<T>
>> 
>> First I think this would be merge(T)/contains(T) pair. But I think it is a
>> bad idea to confuse the general BloomFilter responsibility with the Hasher
>> responsibility. I suggest leaving the general structure as defined above in
>> place and create a TypedBloomFilter<T> class that works much as you
>> suggest. We have to make a number of assumptions. For example we assume
>> that BitSetBloomFilter is the appropriate implementation and the
>> DynamicHasher the appropriate Hasher.
>> 
>> TypedBloomFilter<T> extends BitSetBloomFilter {
>> 
>> private Function<T,DataSink> fn;
>> 
>> TypedBloomFilter( Function<T,DataSink> fn, Shape shape ) {
>> 
>> super( shape );
>> 
>> this.fn = fn;
>> 
>> }
>> 
>> public void merge( T t ) {
>> 
>> merge( new DynamicHasher.Builder().with( fn.apply( t ) ) );
>> 
>> }
>> 
>> public boolean contains( T t ) {
>> 
>> return contains( new DynamicHasher.Builder().with( fn.apply( t ) ) );
>> 
>> }
>> 
>> }
> 
> IMO the general BloomFilter responsibility is storing indexes. I am 
> suggesting that the indexes always come from somewhere: an item T. So change 
> BloomFilter to be typed to <T> and require that the filter is provided with 
> an implementation that converts T to indexes.
> 
> If you want to store properties from an item in a filter then you can still 
> do this. You just have a more advanced implementation to convert T (the base 
> property type) to indexes.
> 
> 
>> Thread Safety
>> 
>> I think we note that Bloom filters are not generally safe for multithreaded
>> use without external synchronization.
> 
> OK. But as I said the search for items in a filter can be thread-safe since 
> no structural modification of the filter is performed. I think this should be 
> considered. It is the same for all other collections in the Java SDK that are 
> not explicitly concurrent. Query of them is thread safe as long as no 
> structural modification occurs during the query.
> 
>> Rework the HashFunction and HashFunctionIdentifier issues
>> 
>> We recognize that there are issues with HashFunctions and the identifier as
>> preserved in the Shape. I think that it is overly complex and prone to
>> error. Perhaps the proper solution is to remove the HasherIdentity from
>> Shape and create a method on Hasher String getID(). In most cases the
>> string would return the fully qualified class name, in cases where
>> difference hashing algorithms can be defined the string should return the
>> className prepended to the algorithm name or something similar. We can then
>> remove the HashFunction interface and all the stuff that derives from it.
>> Bloom filter equivalence will then depend upon Shape and Hasher equivalence.
> 
> Dropping the entire HashFunctionIdentity from the public API is what I 
> suggest. Test equivalence of filter using their strategy to convert items to 
> indexes. But if a BloomFilter is to state equivalence using the Hasher then 
> it would need to store its own hasher. This is not currently done. If you do 
> this then you are moving one step closer to my suggestion of storing the 
> IndexGenerator<T> in the filter. This is effectively a dynamic hasher of 
> objects. I avoided the name Hasher for this purpose but it could be called 
> that.
> 
> 
> In summary I do not currently have a good reason to understand why 
> BloomFilter cannot be typed. If there is a lot more code that you have to 
> contribute it seems to be that we should put together all the code and what 
> you intend it to do. It is the extra cases that you have that conflict with 
> what I have suggested. Since I do not know everything you currently do with 
> BloomFilters then it may be better to see it all before deciding the design 
> for commons collections. 
> 
> It is important to consider what commons-collections is trying to achieve. It 
> is a stable API for code structures that are used to collectively group 
> items. Almost all of it is generic collections typed to items. Any new code 
> should be compatible with this scope. Since we maintain a stable API it is 
> essential that the API is correct on first release. This means we should 
> consider all future functionality that we are to support. It is made easier 
> by restricting functionality to coherent blocks that should not require a 
> change.
> 
> What do you think to either:
> 
> 1. moving all this to a bloom-filters development branch and working on it 
> there while the API is sorted to include the remaining code
> 2. limiting the scope of the contribution to commons-collections to basic 
> BloomFilter functionality
> 
> 1. This option may be close to just taking over maintenance of your current 
> code base. If the code is generally applicable then fine. From what has so 
> far been contributed I feel that we have made a lot of changes to clarify 
> things. If this continues then further code will require a fair bit more work 
> and a development branch seems a suitable action.
> 
> 2. This should provide a structure for BloomFilters similar to others I can 
> find for Java (e.g. Guava). I suggest we offer something more than other 
> implementations by allowing configuration of the hashing of items. It would 
> allow advanced usage to optimise item hashing. This is in line with what the 
> current code currently allows. But it may mean stripping the API to a minimal 
> first release.
>


Attempting to re-awaken this thread.

IMO the bloomfilter package is currently under development. The original 
contributor was working through submitting Bloom filter functionality in parts. 
My involvement was to ensure the current code that was initially merged was 
documented (in javadoc) and functioned as described. This led to changes of the 
current functionality and ideas for changes. There were many ideas in this 
thread history that were discussed, some agreed and not yet implemented, some 
still open for discussion.

Development has stalled and the vision for the final package has not been 
completed. At present the bloomfilter added to collections does not use 
generics and it has very little to do with java.util.Collection. So is it even 
in the remit of commons-collections?

I suggested moving to a development branch while the package functionality is 
stabilised if any other contributors wish to work on this. I have left 
development alone as it seems redundant to progress without a defined scope for 
the ultimate functionality. 

Opinions on how to proceed?

Alex

Re: [BloomFilters] changes to BloomFilter

Reply via email to