On Fri, May 1, 2009 at 7:59 PM, Matthew Toseland <toad at amphibian.dyndns.org> wrote: > On Saturday 02 May 2009 00:53:27 Evan Daniel wrote: >> On Fri, May 1, 2009 at 7:01 PM, Matthew Toseland >> <toad at amphibian.dyndns.org> wrote: >> > On Friday 01 May 2009 22:43:50 Robert Hailey wrote: >> >> >> >> On May 1, 2009, at 3:46 PM, Evan Daniel wrote: >> >> >> >> > Yes, that's a question worth considering. ?There are both performance >> >> > and security issues involved, I think. ?Note that the partition could >> >> > be a set of contiguous regions (allowing performance optimization >> >> > around which piece of the keyspace you send info about), but it could >> >> > just as easily be determined by a hash function instead. >> >> > >> >> > You still check the same number of filters overall -- one per peer. >> >> > The difference is that for some peers you may have a partial filter >> >> > set, and therefore sometimes check their filters, instead of deciding >> >> > you don't have the memory for that peer's filter and never checking >> >> > it. >> >> >> >> Maybe if we partition it we can also get a free datastore histogram on >> >> the stats page. >> > >> > No, we cannot divide by actual keyspace, the keys must be hashed first, or > the >> > middle bloom filter will be far too big. >> >> Well, you could partition by actual keyspace as long as the partitions >> are (approximately) equal in population rather than fraction of the >> keyspace they cover. ?Doing that is only mildly nontrivial, and gives >> you a histogram with variable-width bars, but still an accurate >> histogram. ?Each bar would cover the same area; tall and skinny near >> the node's location, wide and short away from it. > > And if the distribution was ever to change ... it's not mildly nontrivial > IMHO. > > Anyway, the first version should simply use the existing filters, to make > things easy. >
You have to recompute the bloom filter occasionally regardless, otherwise the counters eventually all saturate. If the distribution changes enough to be problematic, you rebalance the filters. If you're willing to be slightly inefficient, you can avoid recomputing all the filters. When one filter starts getting full, you split its portion of the keyspace in half and create a pair to take its place. On average your filters will only be 3/4 full or so, but you reduce the computational load (though probably not the disk load? hmm...) Or you could just partition the keyspace by hashing and trust that equal size partitions will have equal populations. Evan Daniel
