[freenet-dev] Can we implement Bloom filter sharing quickly???

Matthew Toseland Thu, 2 Jul 2009 18:32:26 +0100

On Saturday 20 June 2009 02:15:44 Juiceman wrote:
> On Fri, Jun 19, 2009 at 1:50 PM, Evan Daniel<evanbd at gmail.com> wrote:
> > On Fri, Jun 19, 2009 at 1:36 PM, Matthew
> > Toseland<toad at amphibian.dyndns.org> wrote:
> >> On Friday 19 June 2009 17:37:00 Robert Hailey wrote:
> >>>
> >>> On Jun 18, 2009, at 9:07 PM, Matthew Toseland wrote:
> >>>
> >>> > CHUNK SIZE PROBLEM:
> >>> > ===================
> >>> >
> >>> > Current plans call to split the keyspace up for each datastore, and
> >>> > assign keys to a manageable sized bloom filter for a section of the
> >>> > (hashed) keyspace. These can then be transferred separately, are
> >>> > useful immediately after being transferred, can be juggled in
> >>> > memory, etc. However, we cannot guarantee that the populaion of such
> >>> > a chunk will be small enough for the filter to be effective. Solving
> >>> > this requires either moving the boundaries dynamically (possibly
> >>> > continually), or making the chunks significantly larger than they
> >>> > would need to be in an ideal world.
> >>>
> >>> Right... but I believe we prescribed that the split would be based on
> >>> a hashed value of the key and not by logical keyspace location to
> >>> avoid disproportionate chunks.
> >>>
> >>> That is to say... ideally a node is going to get a disporportionate
> >>> amount of cache/store data about it's network location.
> >>>
> >>> STORE_SIZE/MAX_BLOOM_CHUNK -> N_CHUNKS
> >>> H(key, N_CHUNKS) => n & (0 < n < N)
> >>> CHUNK[n].add(key)
> >>>
> >>> Wouldn't the problem be reduced to finding a well-scattering hash
> >>> function then?
> >>
> >> Yes and no. How much variation will we have even if we divide by hashed 
> >> keyspace? Hence how much bigger than the ideal splitting size do we need 
> >> each chunk to be to maintain approximately the right false positives ratio?
> >
> > If the hash spreads things well, number of keys in a single bloom
> > filter should be normally distributed with mean ((total keys) /
> > (number of filters)) and standard deviation sqrt((total keys) * (1 /
> > (number of filters)) * (1 - 1 / (number of filters))). ?(It's a
> > binomial distribution; any given key has a 1/ number of filters chance
> > of landing in a specific filter.) ?For a 100GiB total store, we have
> > 3E6 keys between store and cache (for CHKs, same for SSKs). ?That
> > implies 17MiB of bloom filters. ?If we size the filters at 1MiB each,
> > with 17 filters, we have 176k keys per filter on average. ?From the
> > preceding formula, standard deviation is 408 keys.
> >
> > Size variation is only a serious concern if the hash function is not
> > distributing the keys at random. ?To be safe, we could slightly
> > underfill the filters.
> >
> > Evan Daniel


> Perhaps this is what you are discussing, but if (generally speaking)
> requests are supposed to route to the neighboring node with the
> closest location to hopefully get closer each time, would it make
> sense to limit the bloom filter we share to a section of the keyspace
> we are supposed to specialize in?  For example, say my node's location
> is 0.5, I have neighbors expecting me to have data close to 0.5.  Do
> they care if I have a key for the 0.9 area of the keyspace?  They
> should be asking checking the bloom filter of their top "n" choices of
> neighbors for that keyspace for a direct hit.  If not, chuck it
> towards toward the closest one and move on.  The next node in the
> chain will do the same hopefully closer each time...

No, IMHO it would not make sense. If we are going to route there anyway then 
why share the filters? Bloom filter sharing is good for:
1) Finding popular content *really* fast by seeing it on a nearby node.
2) Finding rare content on a node that is in the approximate region where the 
data should be, but is not our best choice.

Both can save a lot of hops.
> 
> Might we instead send the bloom filter of "x" portion of our
> datastore, "x" being either a fixed number of keys near our
> specialization or a percent of our datastore in each direction.  Even
> if we just send the closest 25%-50% it should provide an improvement
> in speed at a cost/benefit/risk ratio that would be easier than full
> bloom filter sharing...
> 
> This would reduce the amount of bloom filter data needing to be shared
> while limiting exposing that we have received an insert.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20090702/fa3f90e7/attachment.pgp>

[freenet-dev] Can we implement Bloom filter sharing quickly???

Reply via email to