On Fri, Jun 19, 2009 at 1:36 PM, Matthew Toseland<toad at amphibian.dyndns.org> wrote: > On Friday 19 June 2009 17:37:00 Robert Hailey wrote: >> >> On Jun 18, 2009, at 9:07 PM, Matthew Toseland wrote: >> >> > CHUNK SIZE PROBLEM: >> > =================== >> > >> > Current plans call to split the keyspace up for each datastore, and >> > assign keys to a manageable sized bloom filter for a section of the >> > (hashed) keyspace. These can then be transferred separately, are >> > useful immediately after being transferred, can be juggled in >> > memory, etc. However, we cannot guarantee that the populaion of such >> > a chunk will be small enough for the filter to be effective. Solving >> > this requires either moving the boundaries dynamically (possibly >> > continually), or making the chunks significantly larger than they >> > would need to be in an ideal world. >> >> Right... but I believe we prescribed that the split would be based on >> a hashed value of the key and not by logical keyspace location to >> avoid disproportionate chunks. >> >> That is to say... ideally a node is going to get a disporportionate >> amount of cache/store data about it's network location. >> >> STORE_SIZE/MAX_BLOOM_CHUNK -> N_CHUNKS >> H(key, N_CHUNKS) => n & (0 < n < N) >> CHUNK[n].add(key) >> >> Wouldn't the problem be reduced to finding a well-scattering hash >> function then? > > Yes and no. How much variation will we have even if we divide by hashed > keyspace? Hence how much bigger than the ideal splitting size do we need each > chunk to be to maintain approximately the right false positives ratio?
If the hash spreads things well, number of keys in a single bloom filter should be normally distributed with mean ((total keys) / (number of filters)) and standard deviation sqrt((total keys) * (1 / (number of filters)) * (1 - 1 / (number of filters))). (It's a binomial distribution; any given key has a 1/ number of filters chance of landing in a specific filter.) For a 100GiB total store, we have 3E6 keys between store and cache (for CHKs, same for SSKs). That implies 17MiB of bloom filters. If we size the filters at 1MiB each, with 17 filters, we have 176k keys per filter on average. From the preceding formula, standard deviation is 408 keys. Size variation is only a serious concern if the hash function is not distributing the keys at random. To be safe, we could slightly underfill the filters. Evan Daniel
