IMHO all the evidence points to Bloom filter sharing as *substantially* improving both speed and data reachability. I wonder if it might not be possible to implement it quickly, either before 0.8, after 0.8-beta1 and before 0.8, or for a quick 0.8.1/0.9? The main issues:
THE NUMBERS: The salted-hash datastore uses by default 1/2048th of the store size for the Bloom filters. This works out to a 23-element filter with 2-bit counting, i.e. 46 bits per key in the store (ssk, chk, pubkey, store or cache). The easiest thing would be to reduce this to a 1-bit filter, cutting its size in half. Also we don't need the pubkey store. So the filter for our peers would be approximately 1/6000th of the size of our datastore. This gives 18MB per 100GB datastore size. We could halve this again but it would mean that the average failing request is diverted to nodes which might have the data but don't 2.5 times (24 hops * 20 peers = 480 bloom checks, vs 0.005% false positives); each time the request has to wait. So if we have 20 peers each with a 500GB store, that gives 1.8GB of Bloom filters! On the other hand if we have 20 peers each with a 10GB store, that gives 36MB of Bloom filters. Clearly the latter is acceptable; the former probably isn't for most nodes. Also note that for acceptable performance it is essential that the Bloom filters for all our online peers can fit into RAM. What we are going to have to do is set an upper limit on the total memory we can use for Bloom filters, and drop the biggest filters until we reach the target. Say we set the bloom limit to 128MB. Then a typical node would have: 192MB - default heap limit, for freenet itself 25MB - bloom filter for 50GB local datastore 128MB - default remote bloom filters limit, allows peers to have 700GB total store size For a total of around 345MB plus java overheads. Is this sort of number acceptable? SECURING IT: Giving an index of our datastore to our opennet peers, while storing all locally requested data in our datastore, would seem suicidal from a security perspective. We need to not store requests locally. How? The best answer atm is to implement encrypted rendezvous-tunnels. IMHO this will take considerable time to get right, especially given that one tunnel is highly unlikely to provide good performance. But I wonder if a half-way approach might be sufficient for now, enabling us to not cache locally, while not reducing security overall? - While the HTL is at maximum, check the datastore but do not cache returned data, and route randomly rather than according to the provided key. - We already decrement the HTL from the maximum randomly, and decide this once per peer. - We might want to increase the chance of decrementing from its current 25% (average 4 hops) to say 33% or even higher. IMHO this would take well under one day to implement. The question is, does it provide security *NO WORSE THAN* the current situation? IMHO it does. If we don't change the decrement probability, then nearby attackers already have the HTL flag. Plus, random routing and routing to random keys are virtually identical for purposes of correlation attacks. CHURN IMPACT: The main worry with regards to Bloom filter sharing performance is that it won't work on opennet. We only want to share our Bloom filters with long-lived peers, because there is some cost to transferring them in the first place and in keeping them up to date. This is an unknown variable really at this point: what proportion of our opennet peers will be long-lived enough that it is worthwhile sharing our bloom filters with them? IMPLEMENTING IT: Main tasks: - Converting our datastore indexes into a compact, slightly more lossy, 1-bit filter. (Easy) - Creating a snapshot once an hour, and keeping for some period (1 week?). (Easy) - Creating an efficient binary diff format for hour-to-hour diffs, and keeping them. (Moderate) - Sending our filters to our darknet peers. (Easy) - Updating our darknet peers once an hour with the diffs we create anyway. (Easy) - Recording what version of the indexes we have sent to each darknet peer. (Easy) - Updating our darknet peers when they reconnect after downtime with all the diffs they missed. (Easy) - Using any filters we have to send GetOfferedKey-like requests for the data, and handling such requests. (Easy) - Track success rates for bloom-based fetches. (Optional, can be done later) (Easy) - Tracking a large number of opennet peers which we have connected to over the last week, recording their total uptime over the week. (Moderate) - Calculating which have sufficient uptime for sharing bloom filters to be worthwhile. (Easy) - Sending those nodes the bloom filters, and keeping them up to date, and recording what version we sent them last. (Easy) - Limit the total memory usage for Bloom filters, by adding it up and then dropping the biggest filters from active use until we're not over the limit. Tell our peers so they don't send us updates. (Moderate) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20090501/d9e0dd1c/attachment.pgp>
