Re: [BloomFilters] changes to BloomFilter

Alex Herbert Wed, 18 Mar 2020 04:40:40 -0700


> On 17 Mar 2020, at 22:34, Claude Warren <cla...@xenei.com> wrote:
> 
> Builder discussion:
> 
> Let's go with
> 
>>> Builder with(CharSequence, Charset);
>>> Builder withUnencoded(CharSequence);


Added to master.

I already note that not having it mandate UTF-8 is annoying. I had to include 
StandardCharsets in a lot of places in the test code. So perhaps we add:

Builder withUtf8(CharSequence)

> 
> Shape Discussion:
> 
> as for getNumberOfBytes() it should return the maximum number of bytes
> returned by a getBits() call to a filter with this shape.  So yes, if there
> is a compressed internal representation, no it won't be that.  It is a
> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
> 8.0 )
> 
> Basically, if you want to create an array that will fit all the bits
> returned by BloomFilter.iterator() you need an array of
> Shape.getNumberOfBytes().  And that is actually what I use it for.

Then you are also mapping the index to a byte index and a bit within the byte. 
So if you are doing these two actions then this is something that you should 
control.

> 
> Bloom Filter Discussion:
> 
> I have not use probability() on a single filter, just on the Shape.
> However, I can see the value of it.  It seems to me that the difference
> between Shape.probability() and BloomFilter.probability() would give you an
> indication of how many collisions have already occurred.
> 
> In the SetOperations class there is an "estimateSize()" method.  It is the
> only single Bloom filter argument method in the class and I think it should
> be moved into BloomFilter so we have 2 new methods:
> 
> probability()
> estimateSize()
> 
> Counting Filter Discussion:
> 
> As for counting filters we could implement several
> int
> short
> byte

Yes. Each supports adding a maximum number of the same item. Since we do not 
know the use case leaving it at only int for now is easiest.

The alternative is duplicating the logic for each backing storage, removing the 
public constructor, making the class abstract and providing a factory 
constructor:

ArrayCountingBloomFilter bf = ArrayCountingBloomFilter.create(shape, int 
maximumDuplicateItems);

The actual instance returned is then based on the capacity required to store 
the duplicates.

However the counts at each index are random and may exceed the duplicates by 
chance. I don’t want to go the route of requiring probability computations in 
the factory constructor for likelihood of exceeding the capacity.

A simple approach would be to have a byte[] version when you expect not to add 
duplicates. This filter will simply function to allow removing items you 
previously added. The probabilities I previously listed show that a count of 
127 by random chance is < 1e-100 if the filter is reasonably big. We should at 
least provide a link to this computation. It requires a binomial distribution 
and collections does not current depend on common-math.

The int[] version should be used when you expect to be able to add duplicates 
and want to use the contains(Hasher, count) function.

> 
> Currently they would all have to return int counts but realistically that
> is what would be used in code anyway.  Also, once primitives can be used in
> generics this will be easier.
> 
> As for contains( Hasher, int ), I think we would need to add contains(
> BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
> would test that a BloomFilter has been added X times or rather that there
> are enough counts in the right places to make it appear that BloomFilter
> has been added X times.  When used with a Hasher, remove the duplicates,
> and perform the same test.
> 
> I see no reason not to add them.

OK. 

> 
> On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert <alex.d.herb...@gmail.com 
> <mailto:alex.d.herb...@gmail.com>>
> wrote:
> 
>> 
>> 
>>> On 17 Mar 2020, at 17:06, Claude Warren <cla...@xenei.com> wrote:
>>> 
>>> On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <alex.d.herb...@gmail.com>
>>> wrote:
>>> 
>>>> 
>>>> 
>>>>> On 17 Mar 2020, at 15:41, Claude Warren <cla...@xenei.com> wrote:
>>>>> 
>>>>> I agree with the HashFunction changes.
>>>> 
>>>> OK, but which ones?
>>>> 
>>> 
>>> DOH! this one...
>>> 
>>>> 
>>>> Changing HashFunction to have two methods:
>>>> 
>>>> long hash(byte[])
>>>> long increment(int seed)
>> 
>> OK. I’ll update.
>> 
>>>> 
>>>>> I think Builder should have
>>>>> with(byte[])
>>>>> with(byte[], int offset, int len )
>>>> 
>>>> Not convinced here. The HashFunction requires a byte[] and cannot
>> operate
>>>> on a range. This change should be made in conjunction with a similar
>> change
>>>> to HashFunction. So should we update HashFunction to:
>>>> 
>>>> 
>>> Given the depth of the change let's just leave the with( byte[] )
>>> 
>>> 
>>>>> with(String)
>>>>> 
>>>>> I find that I use with(String) more than any other with() method.
>>>> 
>>>> That may be so but String.getBytes(Charset) is trivial to call for the
>>>> user. Then they get to decide on the encoding and not leave it to the
>>>> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice
>> as a
>>>> cross-language standard. Leave it out of the API for now, or add both:
>>>> 
>>>> Builder with(CharSequence, Charset);
>>>> Builder withUnencoded(CharSequence);
>>>> 
>>> 
>>> CharSequence has no easy method to convert to a byte[]. While it could be
>>> done, it looks to be more of a streaming interface.  Let's leave that
>> out.
>> 
>> I was thinking:
>> 
>>        /**
>>         * Adds a character sequence item to the hasher using the
>> specified encoding.
>>         *
>>         * @param item the item to add
>>         * @param charset the character set
>>         * @return a reference to this object
>>         */
>>        default Builder with(CharSequence item, Charset charset) {
>>            return with(item.toString().getBytes(charset));
>>        }
>> 
>>        /**
>>         * Adds a character sequence item to the hasher. Each 16-bit
>> character is
>>         * converted to 2 bytes using little-endian order.
>>         *
>>         * @param item the item to add
>>         * @return a reference to this object
>>         */
>>        default Builder withUnencoded(CharSequence item) {
>>            final int length = item.length();
>>            final byte[] bytes = new byte[length * 2];
>>            for (int i = 0; i < length; i++) {
>>                final char ch = item.charAt(i);
>>                bytes[i * 2] = (byte) ch;
>>                bytes[i * 2 + 1] = (byte) (ch >>> 8);
>>            }
>>            return with(bytes);
>>        }
>> 
>>> 
>>> 
>>>> I would argue that you may use BloomFilters for Strings but if we see a
>>>> BloomFilter as a collection then we should really support all Objects
>> (with
>>>> a decorator or by typing the Builder) or not support Objects. Currently
>> we
>>>> are not supporting any Object so for now would drop this and the
>>>> Hasher.Builder then becomes a very simple API that specifies that you
>> put
>>>> in items represented as a byte[] and call build to create a Hasher
>>>> containing those items and reset for further use.
>>>> 
>>> 
>>> I have code example in several places where I hash GeoCode entities.
>> Since
>>> they are comprised of strings, for the most part, building a hasher for
>>> them simply requires hashing the Strings.  Many web services use JSON and
>>> most JSON is string based.  I disagree with removing with(String) because
>>> it is so convenient in so many cases.  It also makes the code
>>> cleaner/easier to read.  But if you feel strongly about removing it then
>> OK.
>> 
>> So with(String) is replaced by a default with(CharSequence, Charset)
>> method.
>> 
>>> 
>>> The only other thing that is really bothersome is the lack of
>>> Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
>>> Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
>>> readable in the code.
>> 
>> I removed this as the number of bytes is implementation dependent, for
>> example if using a compressed BloomFilter.
>> 
>> If the number of bytes is intended to be for the uncompressed
>> representation then this does not match up with the canonical format of the
>> BloomFilter using a long[].
>> 
>> Q. Why do you need the number of bytes?
>> 
>> 
>> One thing we do not have is the computation for a false probability using
>> the current cardinality (this is in the Guava code):
>> 
>> p(false positive) = (cardinality() / m) ^ k
>> 
>> Which is just a probability that you will choose all k indexes as ones
>> that are already set. This is not on the wikipedia page. Is this something
>> you have used?
>> 
>> 
>> Also on the wikipedia page is a section on CountingBloomFilters where each
>> bucket is stated as typically having 3-4 bits. Using 4 bits would allow
>> counting to 16 for each index. It would be more space efficient than the
>> current use of 32-bit integers. Changing to 8-bit bytes is a simple
>> compromise. There is also a main page on counting Bloom filters [1]:
>> 
>> For a typical shape with p 1e-6 and n items we have
>> 
>> N = 1000, M = 28756, k = 20
>> N = 10000, M = 287552, k = 20
>> 
>> Worst case scenario is each item added sets the same index and count could
>> be equal to n. But if the hash is uniform then the count for each bit will
>> be much lower. It is taken from the binomial distribution using number of
>> trials = nk, and probability of success = 1/m. Here are the expected counts
>> (l) for an index when the filter is filled with n items (p is just used to
>> set optimal k and m using n):
>> 
>> n       k     m        p           l    binom(l, nk, 1/m)
>> 1000    20    28756    1.00E-06    0    0.498815437
>> 1000    20    28756    1.00E-06    1    0.346941705
>> 1000    20    28756    1.00E-06    2    0.12064836
>> 1000    20    28756    1.00E-06    4    0.004862559
>> 1000    20    28756    1.00E-06    8    6.76619E-07
>> 1000    20    28756    1.00E-06    16   7.10853E-17
>> 1000    20    28756    1.00E-06    32   1.66389E-41
>> 1000    20    28756    1.00E-06    64   2.8772E-100
>> 1000    20    28756    1.00E-06    127  1.0449E-234
>> 10000   20    287552   1.00E-06    0    0.498811214
>> 10000   20    287552   1.00E-06    1    0.346937562
>> 10000   20    287552   1.00E-06    2    0.120651928
>> 10000   20    287552   1.00E-06    4    0.004863763
>> 10000   20    287552   1.00E-06    8    6.77448E-07
>> 10000   20    287552   1.00E-06    16   7.14657E-17
>> 10000   20    287552   1.00E-06    32   1.70126E-41
>> 10000   20    287552   1.00E-06    64   3.15E-100
>> 10000   20    287552   1.00E-06    127  1.4984E-234
>> 
>> So being able to store counts of int max value is total overkill. We could
>> using 4 bits per counter. But this is harder to do with the current design
>> that uses the sign of entries as a guard digit for overflow or negative
>> counts. A simple switch is to drop to a byte[] and allow counts up to 127.
>> Any overflow would mark the state invalid. This makes the
>> ArrayCountingBloomFilter 4 times more space efficient.
>> 
>> Note that a counting Bloom filter is also described to have functionality
>> we do not currently have. We can test for membership of a Hasher (e.g. an
>> item) within the CountingBloomFilter but not query for a set number of the
>> same item. Should this be added to the interface:
>> 
>> boolean contains(Hasher hasher, int count)
>> 
>> Here the Hasher is an issue as it can represent multiple items which could
>> overlap indexes. In this case the 'contains a specified count' method is
>> invalid as the expected count for those indexes that are overlapping would
>> be higher.
>> 
>> [1] https://en.wikipedia.org/wiki/Counting_Bloom_filter <
>> https://en.wikipedia.org/wiki/Counting_Bloom_filter 
>> <https://en.wikipedia.org/wiki/Counting_Bloom_filter>>
>>> 
>>> Claude
>> 
>> 
> 
> -- 
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren 
> <http://www.linkedin.com/in/claudewarren>

Re: [BloomFilters] changes to BloomFilter

Reply via email to