On Saturday 20 June 2009 03:38:03 Evan Daniel wrote: > On Fri, Jun 19, 2009 at 9:15 PM, Juiceman<juiceman69 at gmail.com> wrote: > > On Fri, Jun 19, 2009 at 1:50 PM, Evan Daniel<evanbd at gmail.com> wrote: > >> On Fri, Jun 19, 2009 at 1:36 PM, Matthew > >> Toseland<toad at amphibian.dyndns.org> wrote: > >>> On Friday 19 June 2009 17:37:00 Robert Hailey wrote: > >>>> > >>>> On Jun 18, 2009, at 9:07 PM, Matthew Toseland wrote: > >>>> > >>>> > CHUNK SIZE PROBLEM: > >>>> > =================== > >>>> > > >>>> > Current plans call to split the keyspace up for each datastore, and > >>>> > assign keys to a manageable sized bloom filter for a section of the > >>>> > (hashed) keyspace. These can then be transferred separately, are > >>>> > useful immediately after being transferred, can be juggled in > >>>> > memory, etc. However, we cannot guarantee that the populaion of such > >>>> > a chunk will be small enough for the filter to be effective. Solving > >>>> > this requires either moving the boundaries dynamically (possibly > >>>> > continually), or making the chunks significantly larger than they > >>>> > would need to be in an ideal world. > >>>> > >>>> Right... but I believe we prescribed that the split would be based on > >>>> a hashed value of the key and not by logical keyspace location to > >>>> avoid disproportionate chunks. > >>>> > >>>> That is to say... ideally a node is going to get a disporportionate > >>>> amount of cache/store data about it's network location. > >>>> > >>>> STORE_SIZE/MAX_BLOOM_CHUNK -> N_CHUNKS > >>>> H(key, N_CHUNKS) => n & (0 < n < N) > >>>> CHUNK[n].add(key) > >>>> > >>>> Wouldn't the problem be reduced to finding a well-scattering hash > >>>> function then? > >>> > >>> Yes and no. How much variation will we have even if we divide by hashed > >>> keyspace? Hence how much bigger than the ideal splitting size do we need > >>> each chunk to be to maintain approximately the right false positives > >>> ratio? > >> > >> If the hash spreads things well, number of keys in a single bloom > >> filter should be normally distributed with mean ((total keys) / > >> (number of filters)) and standard deviation sqrt((total keys) * (1 / > >> (number of filters)) * (1 - 1 / (number of filters))). ?(It's a > >> binomial distribution; any given key has a 1/ number of filters chance > >> of landing in a specific filter.) ?For a 100GiB total store, we have > >> 3E6 keys between store and cache (for CHKs, same for SSKs). ?That > >> implies 17MiB of bloom filters. ?If we size the filters at 1MiB each, > >> with 17 filters, we have 176k keys per filter on average. ?From the > >> preceding formula, standard deviation is 408 keys. > >> > >> Size variation is only a serious concern if the hash function is not > >> distributing the keys at random. ?To be safe, we could slightly > >> underfill the filters. > >> > >> Evan Daniel > >> _______________________________________________ > >> Devl mailing list > >> Devl at freenetproject.org > >> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl > >> > > > > Perhaps this is what you are discussing, but if (generally speaking) > > requests are supposed to route to the neighboring node with the > > closest location to hopefully get closer each time, would it make > > sense to limit the bloom filter we share to a section of the keyspace > > we are supposed to specialize in? ?For example, say my node's location > > is 0.5, I have neighbors expecting me to have data close to 0.5. ?Do > > they care if I have a key for the 0.9 area of the keyspace? ?They > > should be asking checking the bloom filter of their top "n" choices of > > neighbors for that keyspace for a direct hit. ?If not, chuck it > > towards toward the closest one and move on. ?The next node in the > > chain will do the same hopefully closer each time... > > > > Might we instead send the bloom filter of "x" portion of our > > datastore, "x" being either a fixed number of keys near our > > specialization or a percent of our datastore in each direction. ?Even > > if we just send the closest 25%-50% it should provide an improvement > > in speed at a cost/benefit/risk ratio that would be easier than full > > bloom filter sharing... > > > > This would reduce the amount of bloom filter data needing to be shared > > while limiting exposing that we have received an insert. > > > > Several thoughts in no particular order... > > I see two basic ways to do it. Split fairly with a hash function, or > split on actual keyspace. If splitting on actual keyspace, the > obvious way to do it is to add keys to a filter until it gets overly > full, then split it in half and rescan the datastore. That results in > filters that are (on average) 3/4 full; the hash approach results in > filters that are almost completely full. The memory savings aren't > huge, but the disk io difference might be. > > Splitting on normal keyspace means we can tell our neighbors about > selected portions of the keyspace. That may improve normal routing. > Telling them about the "wrong" portion of the keyspace might make old > keys in the wrong spot that would otherwise seem to have fallen out of > the network more available. Which is better is not clear to me, > though I suspect sending info about the keyspace near the node is. > (Though should we send info about the keyspace near us, or near the > node we're sending to? They're usually but not always similar.) > > Clearly, the ideal case is that we send all of the bloom filters. > Therefore it's a question of optimizing startup and short-lived > connections. I also note that a node should *always* check keys it > gets requests for against every potentially applicable filter (not > just the top n), even if it belongs to a node in the wrong portion of > the keyspace -- checking memory is vastly faster than a network io!
Agreed! On darknet, there is no routing problem, but as you say, there may be a caching-in-wrong-place problem ... but imho this is better than not finding the data at all. Currently 0.5% of requests get turned into inserts, this will help put it back where it is supposed to be; we could maybe make this selective somehow; but it is clear that if we can find data with less network effort, that's a *good* thing... > > Lastly, re: my standard deviation formula above, it can be > approximated as s.d. = sqrt(number of keys in filter). (Simplifying > with 1-1/n = 1; we have many bloom filters, right?) > > Evan Daniel -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20090702/3f375336/attachment.pgp>
