On Fri, Jun 19, 2009 at 9:15 PM, Juiceman<juiceman69 at gmail.com> wrote:
> On Fri, Jun 19, 2009 at 1:50 PM, Evan Daniel<evanbd at gmail.com> wrote:
>> On Fri, Jun 19, 2009 at 1:36 PM, Matthew
>> Toseland<toad at amphibian.dyndns.org> wrote:
>>> On Friday 19 June 2009 17:37:00 Robert Hailey wrote:
>>>>
>>>> On Jun 18, 2009, at 9:07 PM, Matthew Toseland wrote:
>>>>
>>>> > CHUNK SIZE PROBLEM:
>>>> > ===================
>>>> >
>>>> > Current plans call to split the keyspace up for each datastore, and
>>>> > assign keys to a manageable sized bloom filter for a section of the
>>>> > (hashed) keyspace. These can then be transferred separately, are
>>>> > useful immediately after being transferred, can be juggled in
>>>> > memory, etc. However, we cannot guarantee that the populaion of such
>>>> > a chunk will be small enough for the filter to be effective. Solving
>>>> > this requires either moving the boundaries dynamically (possibly
>>>> > continually), or making the chunks significantly larger than they
>>>> > would need to be in an ideal world.
>>>>
>>>> Right... but I believe we prescribed that the split would be based on
>>>> a hashed value of the key and not by logical keyspace location to
>>>> avoid disproportionate chunks.
>>>>
>>>> That is to say... ideally a node is going to get a disporportionate
>>>> amount of cache/store data about it's network location.
>>>>
>>>> STORE_SIZE/MAX_BLOOM_CHUNK -> N_CHUNKS
>>>> H(key, N_CHUNKS) => n & (0 < n < N)
>>>> CHUNK[n].add(key)
>>>>
>>>> Wouldn't the problem be reduced to finding a well-scattering hash
>>>> function then?
>>>
>>> Yes and no. How much variation will we have even if we divide by hashed 
>>> keyspace? Hence how much bigger than the ideal splitting size do we need 
>>> each chunk to be to maintain approximately the right false positives ratio?
>>
>> If the hash spreads things well, number of keys in a single bloom
>> filter should be normally distributed with mean ((total keys) /
>> (number of filters)) and standard deviation sqrt((total keys) * (1 /
>> (number of filters)) * (1 - 1 / (number of filters))). ?(It's a
>> binomial distribution; any given key has a 1/ number of filters chance
>> of landing in a specific filter.) ?For a 100GiB total store, we have
>> 3E6 keys between store and cache (for CHKs, same for SSKs). ?That
>> implies 17MiB of bloom filters. ?If we size the filters at 1MiB each,
>> with 17 filters, we have 176k keys per filter on average. ?From the
>> preceding formula, standard deviation is 408 keys.
>>
>> Size variation is only a serious concern if the hash function is not
>> distributing the keys at random. ?To be safe, we could slightly
>> underfill the filters.
>>
>> Evan Daniel
>> _______________________________________________
>> Devl mailing list
>> Devl at freenetproject.org
>> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
>>
>
> Perhaps this is what you are discussing, but if (generally speaking)
> requests are supposed to route to the neighboring node with the
> closest location to hopefully get closer each time, would it make
> sense to limit the bloom filter we share to a section of the keyspace
> we are supposed to specialize in? ?For example, say my node's location
> is 0.5, I have neighbors expecting me to have data close to 0.5. ?Do
> they care if I have a key for the 0.9 area of the keyspace? ?They
> should be asking checking the bloom filter of their top "n" choices of
> neighbors for that keyspace for a direct hit. ?If not, chuck it
> towards toward the closest one and move on. ?The next node in the
> chain will do the same hopefully closer each time...
>
> Might we instead send the bloom filter of "x" portion of our
> datastore, "x" being either a fixed number of keys near our
> specialization or a percent of our datastore in each direction. ?Even
> if we just send the closest 25%-50% it should provide an improvement
> in speed at a cost/benefit/risk ratio that would be easier than full
> bloom filter sharing...
>
> This would reduce the amount of bloom filter data needing to be shared
> while limiting exposing that we have received an insert.
>

Several thoughts in no particular order...

I see two basic ways to do it.  Split fairly with a hash function, or
split on actual keyspace.  If splitting on actual keyspace, the
obvious way to do it is to add keys to a filter until it gets overly
full, then split it in half and rescan the datastore.  That results in
filters that are (on average) 3/4 full; the hash approach results in
filters that are almost completely full.  The memory savings aren't
huge, but the disk io difference might be.

Splitting on normal keyspace means we can tell our neighbors about
selected portions of the keyspace.  That may improve normal routing.
Telling them about the "wrong" portion of the keyspace might make old
keys in the wrong spot that would otherwise seem to have fallen out of
the network more available.  Which is better is not clear to me,
though I suspect sending info about the keyspace near the node is.
(Though should we send info about the keyspace near us, or near the
node we're sending to?  They're usually but not always similar.)

Clearly, the ideal case is that we send all of the bloom filters.
Therefore it's a question of optimizing startup and short-lived
connections.  I also note that a node should *always* check keys it
gets requests for against every potentially applicable filter (not
just the top n), even if it belongs to a node in the wrong portion of
the keyspace -- checking memory is vastly faster than a network io!

Lastly, re: my standard deviation formula above, it can be
approximated as s.d. = sqrt(number of keys in filter). (Simplifying
with 1-1/n = 1; we have many bloom filters, right?)

Evan Daniel

Reply via email to