[freenet-dev] Can we implement Bloom filter sharing quickly???

Evan Daniel Fri, 1 May 2009 15:58:27 -0400

On Fri, May 1, 2009 at 3:37 PM, Robert Hailey <robert at freenetproject.org> 
wrote:
>
> On May 1, 2009, at 9:35 AM, Matthew Toseland wrote:
>
> IMPLEMENTING IT:
> Main tasks:
> - Converting our datastore indexes into a compact, slightly more lossy,
> 1-bit
> filter. (Easy)
> - Creating a snapshot once an hour, and keeping for some period (1 week?).
> (Easy)
> - Creating an efficient binary diff format for hour-to-hour diffs, and
> keeping
> them. (Moderate)
>
> This might actually be quite hard as an?efficient?bloom filter will scatter
> (even a few) updates all over the filter.


Actually, it's not hard at all.

The naive version is to observe that a tiny fraction of the bits in
the filter changed, and just record the location of each bit that
changed.  At 0.24 writes per second, you'll get on average 0.24*3600*2
keys changed per hour (each write represents 1 add and 1 delete), or
0.24*3600*2*16 counters changed per hour.  Unless I'm mistaken,
changing a counter in the counting bloom filter has ~ 50% probability
of changing the bit in the compressed non-counting version.  So that
means 0.24*3600*2*16*0.5=13824 bits changed per hour.

The address of a bit can be represented in 32 bits trivially (for
bloom filters < 512MiB in size), so the 1-hour diff should consume
13824*32/8=55296 bytes.  That represents 15.36 bytes/s of traffic for
each peer, or 307.2B/s across 20 peers.

That encoding isn't terribly efficient.  More efficient is to sort the
addresses and compute the deltas.  (So if bits 19, 7, and 34 changed,
I send the numbers 7, 12, 15.)  Those deltas should follow a geometric
distribution with mean (number of bits changed) / (size of filter).
It's easy to build an arithmetic coding for that data that will
achieve near-perfect compression (see
http://en.wikipedia.org/wiki/Arithmetic_coding for example).  My BOTE
estimate using toad's 84MiB filter it would compress at 14.5 bits per
address (instead of the 30 or 32 you'd get with no compression; gzip
or lzw should be somewhere in between).

Evan

[freenet-dev] Can we implement Bloom filter sharing quickly???

Reply via email to