[freenet-dev] Can we implement Bloom filter sharing quickly???

Matthew Toseland Fri, 1 May 2009 21:02:45 +0100

On Friday 01 May 2009 15:35:22 Matthew Toseland wrote:
> IMHO all the evidence points to Bloom filter sharing as *substantially* 
> improving both speed and data reachability. I wonder if it might not be 
> possible to implement it quickly, either before 0.8, after 0.8-beta1 and 
> before 0.8, or for a quick 0.8.1/0.9? The main issues:
> 
> THE NUMBERS:
> The salted-hash datastore uses by default 1/2048th of the store size for the 
> Bloom filters. This works out to a 23-element filter with 2-bit counting, 
> i.e. 46 bits per key in the store (ssk, chk, pubkey, store or cache). The 
> easiest thing would be to reduce this to a 1-bit filter, cutting its size in 
> half. Also we don't need the pubkey store. So the filter for our peers would 
> be approximately 1/6000th of the size of our datastore. This gives 18MB per 
> 100GB datastore size. We could halve this again but it would mean that the 
> average failing request is diverted to nodes which might have the data but 
> don't 2.5 times (24 hops * 20 peers = 480 bloom checks, vs 0.005% false 
> positives); each time the request has to wait. So if we have 20 peers each 
> with a 500GB store, that gives 1.8GB of Bloom filters! On the other hand if 
> we have 20 peers each with a 10GB store, that gives 36MB of Bloom filters. 
> Clearly the latter is acceptable; the former probably isn't for most nodes. 
> Also note that for acceptable performance it is essential that the Bloom 
> filters for all our online peers can fit into RAM. What we are going to have 
> to do is set an upper limit on the total memory we can use for Bloom 
filters, 
> and drop the biggest filters until we reach the target.
> 
> Say we set the bloom limit to 128MB. Then a typical node would have:
> 192MB - default heap limit, for freenet itself
> 25MB - bloom filter for 50GB local datastore
> 128MB - default remote bloom filters limit, allows peers to have 700GB total 
> store size
> 
> For a total of around 345MB plus java overheads. Is this sort of number 
> acceptable?
> 
> SECURING IT:
> Giving an index of our datastore to our opennet peers, while storing all 
> locally requested data in our datastore, would seem suicidal from a security 
> perspective. We need to not store requests locally. How?
> 
> The best answer atm is to implement encrypted rendezvous-tunnels. IMHO this 
> will take considerable time to get right, especially given that one tunnel 
is 
> highly unlikely to provide good performance.
> 
> But I wonder if a half-way approach might be sufficient for now, enabling us 
> to not cache locally, while not reducing security overall?
> - While the HTL is at maximum, check the datastore but do not cache returned 
> data, and route randomly rather than according to the provided key.
> - We already decrement the HTL from the maximum randomly, and decide this 
once 
> per peer.
> - We might want to increase the chance of decrementing from its current 25% 
> (average 4 hops) to say 33% or even higher.
> 
> IMHO this would take well under one day to implement. The question is, does 
it 
> provide security *NO WORSE THAN* the current situation? IMHO it does. If we 
> don't change the decrement probability, then nearby attackers already have 
> the HTL flag. Plus, random routing and routing to random keys are virtually 
> identical for purposes of correlation attacks.
> 
> CHURN IMPACT:
> The main worry with regards to Bloom filter sharing performance is that it 
> won't work on opennet. We only want to share our Bloom filters with 
> long-lived peers, because there is some cost to transferring them in the 
> first place and in keeping them up to date. This is an unknown variable 
> really at this point: what proportion of our opennet peers will be 
long-lived 
> enough that it is worthwhile sharing our bloom filters with them?
> 
> IMPLEMENTING IT:
> Main tasks:
> - Converting our datastore indexes into a compact, slightly more lossy, 
1-bit 
> filter. (Easy)
> - Creating a snapshot once an hour, and keeping for some period (1 week?). 
> (Easy)
> - Creating an efficient binary diff format for hour-to-hour diffs, and 
keeping 
> them. (Moderate)
> - Sending our filters to our darknet peers. (Easy)
> - Updating our darknet peers once an hour with the diffs we create anyway. 
> (Easy)
> - Recording what version of the indexes we have sent to each darknet peer. 
> (Easy)
> - Updating our darknet peers when they reconnect after downtime with all the 
> diffs they missed. (Easy)
> - Using any filters we have to send GetOfferedKey-like requests for the 
data, 
> and handling such requests. (Easy)
> - Track success rates for bloom-based fetches. (Optional, can be done later) 
> (Easy)
> - Tracking a large number of opennet peers which we have connected to over 
the 
> last week, recording their total uptime over the week. (Moderate)
> - Calculating which have sufficient uptime for sharing bloom filters to be 
> worthwhile. (Easy)
> - Sending those nodes the bloom filters, and keeping them up to date, and 
> recording what version we sent them last. (Easy)
> - Limit the total memory usage for Bloom filters, by adding it up and then 
> dropping the biggest filters from active use until we're not over the limit. 
> Tell our peers so they don't send us updates. (Moderate)


A suggested optimisation from evanbd:

[20:34:30] <evanbd> toad_: I figured out a solution to reducing the bloom 
filter size for memory management.
[20:34:30] <FreenetLogBot> The monitoring script has detected that 
ftp://ftpmirror.sectoor.de/freenet/ is eligible to be back on the main mirror 
rotation list. Welcome Back!
[20:37:16] <evanbd> You standardize on a size for the bloom filters; say 1MiB.  
Then, if your store has 100GB of data, and needs 18MiB of bloom filters, you 
partition the keyspace into 18 chunks of roughly equal population.  Each 
segment of the keyspace then gets its own bloom filter.
[20:38:24] <evanbd> Then, if my node has 100MiB of memory to spend on my 
peers' bloom filters, and 20 peers, I just ask each peer for up to 5 bloom 
filters.
[20:40:11] <toad_> evanbd: it's good to have a partial filter for each node 
yes
[20:40:24] <toad_> evanbd: however, you end up checking more filters which 
increases false positives
[20:40:52] <evanbd> No, the fp rate stays the same.
[20:41:18] <evanbd> Suppose your node has 18 filters, each with a 1E-5 fp rate
[20:41:36] <evanbd> When I get a request, I compare it to your node's filter 
set.
[20:41:57] <evanbd> But only *one* of those filters gets checked, since each 
one covers a different portion of the keyspace
[20:42:19] <evanbd> So, regardless of how many filters you have, I have a 1E-5 
fp rate.
[20:42:46] <evanbd> If I don't have all your filters, the fp rate will (on 
average) be lower, because for some keys I won't have the appropriate filter.
[20:44:33] --> FrinkC has joined this channel (n=FrinkC at unaffiliated/frinkc).
[20:47:31] <toad_> evanbd: hmmmm ok
[20:47:57] --> alfa21 has joined this channel 
(n=chatzill at 89-186-68-50.dcpool.ip.kpnqwest.it).
[20:48:03] <toad_> evanbd: i don't see this making a big difference though, 
all it means is for nodes with big stores you get some matches instead of no 
matches
[20:48:25] --> JucaBlues has joined this channel (n=felipe at 189.79.78.26).
[20:51:07] <-- edt has left this server (Remote closed the connection).
[20:51:26] <FreenetLogBot> The monitoring script has detected a network glitch 
on ftp://ftpmirror.sectoor.de/freenet/ : it has been disabled
[20:53:29] <evanbd> It also means that you can send the filters to an opennet 
peer incrementally, in a way that is useful to it before you have time to 
send the full 80MB filter.
[20:58:51] <toad_> evanbd: that's a good point
[21:00:04] <evanbd> I think that makes the decision about when to bother 
sending a filter to a peer easier, too.  You could just trickle it at a slow 
steady pace; if it disconnects before you finish, that's not a huge waste, 
since hopefully it's gotten some use out of the partial set.
[21:01:10] <evanbd> Anyway, afk for at least a little while.
[21:01:51] <toad_> hmmm that's true
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20090501/30da018b/attachment.pgp>

[freenet-dev] Can we implement Bloom filter sharing quickly???

Reply via email to