On Fri, May 1, 2009 at 3:37 PM, Robert Hailey <robert at freenetproject.org> wrote: > > On May 1, 2009, at 9:35 AM, Matthew Toseland wrote: > > IMPLEMENTING IT: > Main tasks: > - Converting our datastore indexes into a compact, slightly more lossy, > 1-bit > filter. (Easy) > - Creating a snapshot once an hour, and keeping for some period (1 week?). > (Easy) > - Creating an efficient binary diff format for hour-to-hour diffs, and > keeping > them. (Moderate) > > This might actually be quite hard as an?efficient?bloom filter will scatter > (even a few) updates all over the filter.
Actually, it's not hard at all. The naive version is to observe that a tiny fraction of the bits in the filter changed, and just record the location of each bit that changed. At 0.24 writes per second, you'll get on average 0.24*3600*2 keys changed per hour (each write represents 1 add and 1 delete), or 0.24*3600*2*16 counters changed per hour. Unless I'm mistaken, changing a counter in the counting bloom filter has ~ 50% probability of changing the bit in the compressed non-counting version. So that means 0.24*3600*2*16*0.5=13824 bits changed per hour. The address of a bit can be represented in 32 bits trivially (for bloom filters < 512MiB in size), so the 1-hour diff should consume 13824*32/8=55296 bytes. That represents 15.36 bytes/s of traffic for each peer, or 307.2B/s across 20 peers. That encoding isn't terribly efficient. More efficient is to sort the addresses and compute the deltas. (So if bits 19, 7, and 34 changed, I send the numbers 7, 12, 15.) Those deltas should follow a geometric distribution with mean (number of bits changed) / (size of filter). It's easy to build an arithmetic coding for that data that will achieve near-perfect compression (see http://en.wikipedia.org/wiki/Arithmetic_coding for example). My BOTE estimate using toad's 84MiB filter it would compress at 14.5 bits per address (instead of the 30 or 32 you'd get with no compression; gzip or lzw should be somewhere in between). Evan
