[freenet-dev] Can we implement Bloom filter sharing quickly???

Matthew Toseland Fri, 1 May 2009 15:35:22 +0100

IMHO all the evidence points to Bloom filter sharing as *substantially* 
improving both speed and data reachability. I wonder if it might not be 
possible to implement it quickly, either before 0.8, after 0.8-beta1 and 
before 0.8, or for a quick 0.8.1/0.9? The main issues:

THE NUMBERS:
The salted-hash datastore uses by default 1/2048th of the store size for the
Bloom filters. This works out to a 23-element filter with 2-bit counting,
i.e. 46 bits per key in the store (ssk, chk, pubkey, store or cache). The
easiest thing would be to reduce this to a 1-bit filter, cutting its size in
half. Also we don't need the pubkey store. So the filter for our peers would
be approximately 1/6000th of the size of our datastore. This gives 18MB per
100GB datastore size. We could halve this again but it would mean that the
average failing request is diverted to nodes which might have the data but
don't 2.5 times (24 hops * 20 peers = 480 bloom checks, vs 0.005% false
positives); each time the request has to wait. So if we have 20 peers each
with a 500GB store, that gives 1.8GB of Bloom filters! On the other hand if
we have 20 peers each with a 10GB store, that gives 36MB of Bloom filters.
Clearly the latter is acceptable; the former probably isn't for most nodes.
Also note that for acceptable performance it is essential that the Bloom
filters for all our online peers can fit into RAM. What we are going to have
to do is set an upper limit on the total memory we can use for Bloom filters,
and drop the biggest filters until we reach the target.

Say we set the bloom limit to 128MB. Then a typical node would have:
192MB - default heap limit, for freenet itself
25MB - bloom filter for 50GB local datastore
128MB - default remote bloom filters limit, allows peers to have 700GB total
store size

For a total of around 345MB plus java overheads. Is this sort of number
acceptable?

SECURING IT:
Giving an index of our datastore to our opennet peers, while storing all
locally requested data in our datastore, would seem suicidal from a security
perspective. We need to not store requests locally. How?

The best answer atm is to implement encrypted rendezvous-tunnels. IMHO this
will take considerable time to get right, especially given that one tunnel is
highly unlikely to provide good performance.

But I wonder if a half-way approach might be sufficient for now, enabling us
to not cache locally, while not reducing security overall?
- While the HTL is at maximum, check the datastore but do not cache returned
data, and route randomly rather than according to the provided key.
- We already decrement the HTL from the maximum randomly, and decide this once
per peer.
- We might want to increase the chance of decrementing from its current 25%
(average 4 hops) to say 33% or even higher.

IMHO this would take well under one day to implement. The question is, does it
provide security *NO WORSE THAN* the current situation? IMHO it does. If we
don't change the decrement probability, then nearby attackers already have
the HTL flag. Plus, random routing and routing to random keys are virtually
identical for purposes of correlation attacks.

CHURN IMPACT:
The main worry with regards to Bloom filter sharing performance is that it
won't work on opennet. We only want to share our Bloom filters with
long-lived peers, because there is some cost to transferring them in the
first place and in keeping them up to date. This is an unknown variable
really at this point: what proportion of our opennet peers will be long-lived
enough that it is worthwhile sharing our bloom filters with them?

IMPLEMENTING IT:
Main tasks:
- Converting our datastore indexes into a compact, slightly more lossy, 1-bit
filter. (Easy)
- Creating a snapshot once an hour, and keeping for some period (1 week?).
(Easy)
- Creating an efficient binary diff format for hour-to-hour diffs, and keeping
them. (Moderate)
- Sending our filters to our darknet peers. (Easy)
- Updating our darknet peers once an hour with the diffs we create anyway.
(Easy)
- Recording what version of the indexes we have sent to each darknet peer.
(Easy)
- Updating our darknet peers when they reconnect after downtime with all the
diffs they missed. (Easy)
- Using any filters we have to send GetOfferedKey-like requests for the data,
and handling such requests. (Easy)
- Track success rates for bloom-based fetches. (Optional, can be done later)
(Easy)
- Tracking a large number of opennet peers which we have connected to over the
last week, recording their total uptime over the week. (Moderate)
- Calculating which have sufficient uptime for sharing bloom filters to be
worthwhile. (Easy)
- Sending those nodes the bloom filters, and keeping them up to date, and
recording what version we sent them last. (Easy)
- Limit the total memory usage for Bloom filters, by adding it up and then
dropping the biggest filters from active use until we're not over the limit.
Tell our peers so they don't send us updates. (Moderate)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL:
<https://emu.freenetproject.org/pipermail/devl/attachments/20090501/d9e0dd1c/attachment.pgp>

[freenet-dev] Can we implement Bloom filter sharing quickly???

Reply via email to