IMHO all the evidence points to Bloom filter sharing as *substantially* 
improving both speed and data reachability. I wonder if it might not be 
possible to implement it quickly, either before 0.8, after 0.8-beta1 and 
before 0.8, or for a quick 0.8.1/0.9? The main issues:

THE NUMBERS:
The salted-hash datastore uses by default 1/2048th of the store size for the 
Bloom filters. This works out to a 23-element filter with 2-bit counting, 
i.e. 46 bits per key in the store (ssk, chk, pubkey, store or cache). The 
easiest thing would be to reduce this to a 1-bit filter, cutting its size in 
half. Also we don't need the pubkey store. So the filter for our peers would 
be approximately 1/6000th of the size of our datastore. This gives 18MB per 
100GB datastore size. We could halve this again but it would mean that the 
average failing request is diverted to nodes which might have the data but 
don't 2.5 times (24 hops * 20 peers = 480 bloom checks, vs 0.005% false 
positives); each time the request has to wait. So if we have 20 peers each 
with a 500GB store, that gives 1.8GB of Bloom filters! On the other hand if 
we have 20 peers each with a 10GB store, that gives 36MB of Bloom filters. 
Clearly the latter is acceptable; the former probably isn't for most nodes. 
Also note that for acceptable performance it is essential that the Bloom 
filters for all our online peers can fit into RAM. What we are going to have 
to do is set an upper limit on the total memory we can use for Bloom filters, 
and drop the biggest filters until we reach the target.

Say we set the bloom limit to 128MB. Then a typical node would have:
192MB - default heap limit, for freenet itself
25MB - bloom filter for 50GB local datastore
128MB - default remote bloom filters limit, allows peers to have 700GB total 
store size

For a total of around 345MB plus java overheads. Is this sort of number 
acceptable?

SECURING IT:
Giving an index of our datastore to our opennet peers, while storing all 
locally requested data in our datastore, would seem suicidal from a security 
perspective. We need to not store requests locally. How?

The best answer atm is to implement encrypted rendezvous-tunnels. IMHO this 
will take considerable time to get right, especially given that one tunnel is 
highly unlikely to provide good performance.

But I wonder if a half-way approach might be sufficient for now, enabling us 
to not cache locally, while not reducing security overall?
- While the HTL is at maximum, check the datastore but do not cache returned 
data, and route randomly rather than according to the provided key.
- We already decrement the HTL from the maximum randomly, and decide this once 
per peer.
- We might want to increase the chance of decrementing from its current 25% 
(average 4 hops) to say 33% or even higher.

IMHO this would take well under one day to implement. The question is, does it 
provide security *NO WORSE THAN* the current situation? IMHO it does. If we 
don't change the decrement probability, then nearby attackers already have 
the HTL flag. Plus, random routing and routing to random keys are virtually 
identical for purposes of correlation attacks.

CHURN IMPACT:
The main worry with regards to Bloom filter sharing performance is that it 
won't work on opennet. We only want to share our Bloom filters with 
long-lived peers, because there is some cost to transferring them in the 
first place and in keeping them up to date. This is an unknown variable 
really at this point: what proportion of our opennet peers will be long-lived 
enough that it is worthwhile sharing our bloom filters with them?

IMPLEMENTING IT:
Main tasks:
- Converting our datastore indexes into a compact, slightly more lossy, 1-bit 
filter. (Easy)
- Creating a snapshot once an hour, and keeping for some period (1 week?). 
(Easy)
- Creating an efficient binary diff format for hour-to-hour diffs, and keeping 
them. (Moderate)
- Sending our filters to our darknet peers. (Easy)
- Updating our darknet peers once an hour with the diffs we create anyway. 
(Easy)
- Recording what version of the indexes we have sent to each darknet peer. 
(Easy)
- Updating our darknet peers when they reconnect after downtime with all the 
diffs they missed. (Easy)
- Using any filters we have to send GetOfferedKey-like requests for the data, 
and handling such requests. (Easy)
- Track success rates for bloom-based fetches. (Optional, can be done later) 
(Easy)
- Tracking a large number of opennet peers which we have connected to over the 
last week, recording their total uptime over the week. (Moderate)
- Calculating which have sufficient uptime for sharing bloom filters to be 
worthwhile. (Easy)
- Sending those nodes the bloom filters, and keeping them up to date, and 
recording what version we sent them last. (Easy)
- Limit the total memory usage for Bloom filters, by adding it up and then 
dropping the biggest filters from active use until we're not over the limit. 
Tell our peers so they don't send us updates. (Moderate)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20090501/d9e0dd1c/attachment.pgp>

Reply via email to