[freenet-dev] Can we implement Bloom filter sharing quickly???

Matthew Toseland Thu, 2 Jul 2009 18:35:16 +0100

On Saturday 20 June 2009 03:38:03 Evan Daniel wrote:
> On Fri, Jun 19, 2009 at 9:15 PM, Juiceman<juiceman69 at gmail.com> wrote:
> > On Fri, Jun 19, 2009 at 1:50 PM, Evan Daniel<evanbd at gmail.com> wrote:
> >> On Fri, Jun 19, 2009 at 1:36 PM, Matthew
> >> Toseland<toad at amphibian.dyndns.org> wrote:
> >>> On Friday 19 June 2009 17:37:00 Robert Hailey wrote:
> >>>>
> >>>> On Jun 18, 2009, at 9:07 PM, Matthew Toseland wrote:
> >>>>
> >>>> > CHUNK SIZE PROBLEM:
> >>>> > ===================
> >>>> >
> >>>> > Current plans call to split the keyspace up for each datastore, and
> >>>> > assign keys to a manageable sized bloom filter for a section of the
> >>>> > (hashed) keyspace. These can then be transferred separately, are
> >>>> > useful immediately after being transferred, can be juggled in
> >>>> > memory, etc. However, we cannot guarantee that the populaion of such
> >>>> > a chunk will be small enough for the filter to be effective. Solving
> >>>> > this requires either moving the boundaries dynamically (possibly
> >>>> > continually), or making the chunks significantly larger than they
> >>>> > would need to be in an ideal world.
> >>>>
> >>>> Right... but I believe we prescribed that the split would be based on
> >>>> a hashed value of the key and not by logical keyspace location to
> >>>> avoid disproportionate chunks.
> >>>>
> >>>> That is to say... ideally a node is going to get a disporportionate
> >>>> amount of cache/store data about it's network location.
> >>>>
> >>>> STORE_SIZE/MAX_BLOOM_CHUNK -> N_CHUNKS
> >>>> H(key, N_CHUNKS) => n & (0 < n < N)
> >>>> CHUNK[n].add(key)
> >>>>
> >>>> Wouldn't the problem be reduced to finding a well-scattering hash
> >>>> function then?
> >>>
> >>> Yes and no. How much variation will we have even if we divide by hashed 
> >>> keyspace? Hence how much bigger than the ideal splitting size do we need 
> >>> each chunk to be to maintain approximately the right false positives 
> >>> ratio?
> >>
> >> If the hash spreads things well, number of keys in a single bloom
> >> filter should be normally distributed with mean ((total keys) /
> >> (number of filters)) and standard deviation sqrt((total keys) * (1 /
> >> (number of filters)) * (1 - 1 / (number of filters))). ?(It's a
> >> binomial distribution; any given key has a 1/ number of filters chance
> >> of landing in a specific filter.) ?For a 100GiB total store, we have
> >> 3E6 keys between store and cache (for CHKs, same for SSKs). ?That
> >> implies 17MiB of bloom filters. ?If we size the filters at 1MiB each,
> >> with 17 filters, we have 176k keys per filter on average. ?From the
> >> preceding formula, standard deviation is 408 keys.
> >>
> >> Size variation is only a serious concern if the hash function is not
> >> distributing the keys at random. ?To be safe, we could slightly
> >> underfill the filters.
> >>
> >> Evan Daniel
> >> _______________________________________________
> >> Devl mailing list
> >> Devl at freenetproject.org
> >> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
> >>
> >
> > Perhaps this is what you are discussing, but if (generally speaking)
> > requests are supposed to route to the neighboring node with the
> > closest location to hopefully get closer each time, would it make
> > sense to limit the bloom filter we share to a section of the keyspace
> > we are supposed to specialize in? ?For example, say my node's location
> > is 0.5, I have neighbors expecting me to have data close to 0.5. ?Do
> > they care if I have a key for the 0.9 area of the keyspace? ?They
> > should be asking checking the bloom filter of their top "n" choices of
> > neighbors for that keyspace for a direct hit. ?If not, chuck it
> > towards toward the closest one and move on. ?The next node in the
> > chain will do the same hopefully closer each time...
> >
> > Might we instead send the bloom filter of "x" portion of our
> > datastore, "x" being either a fixed number of keys near our
> > specialization or a percent of our datastore in each direction. ?Even
> > if we just send the closest 25%-50% it should provide an improvement
> > in speed at a cost/benefit/risk ratio that would be easier than full
> > bloom filter sharing...
> >
> > This would reduce the amount of bloom filter data needing to be shared
> > while limiting exposing that we have received an insert.
> >
> 
> Several thoughts in no particular order...
> 
> I see two basic ways to do it.  Split fairly with a hash function, or
> split on actual keyspace.  If splitting on actual keyspace, the
> obvious way to do it is to add keys to a filter until it gets overly
> full, then split it in half and rescan the datastore.  That results in
> filters that are (on average) 3/4 full; the hash approach results in
> filters that are almost completely full.  The memory savings aren't
> huge, but the disk io difference might be.
> 
> Splitting on normal keyspace means we can tell our neighbors about
> selected portions of the keyspace.  That may improve normal routing.
> Telling them about the "wrong" portion of the keyspace might make old
> keys in the wrong spot that would otherwise seem to have fallen out of
> the network more available.  Which is better is not clear to me,
> though I suspect sending info about the keyspace near the node is.
> (Though should we send info about the keyspace near us, or near the
> node we're sending to?  They're usually but not always similar.)
> 
> Clearly, the ideal case is that we send all of the bloom filters.
> Therefore it's a question of optimizing startup and short-lived
> connections.  I also note that a node should *always* check keys it
> gets requests for against every potentially applicable filter (not
> just the top n), even if it belongs to a node in the wrong portion of
> the keyspace -- checking memory is vastly faster than a network io!


Agreed!

On darknet, there is no routing problem, but as you say, there may be a 
caching-in-wrong-place problem ... but imho this is better than not finding the 
data at all. Currently 0.5% of requests get turned into inserts, this will help 
put it back where it is supposed to be; we could maybe make this selective 
somehow; but it is clear that if we can find data with less network effort, 
that's a *good* thing...
> 
> Lastly, re: my standard deviation formula above, it can be
> approximated as s.d. = sqrt(number of keys in filter). (Simplifying
> with 1-1/n = 1; we have many bloom filters, right?)
> 
> Evan Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20090702/3f375336/attachment.pgp>

[freenet-dev] Can we implement Bloom filter sharing quickly???

Reply via email to