[freenet-dev] Bloom filters and store probing, after the fact splitfile originator tracing

Matthew Toseland Tue, 12 May 2009 04:48:00 -0700

If an average node has a store size of 20GB (10GB cache) and adds 0.24 keys 
per second, it will add a full cache worth of keys in 713 hours. Salted hash 
means that it does not wipe the store in that period (not being LRU); a 
salted hash store would normally have a half-life for un-wiped keys, but 
quadratic probing makes this a bit easier.


In any case, it takes a *long time*: 29 days. We can maybe halve the figure if 
we assume almost all new keys are CHKs. But that's still around 14 days. And 
it's more on a larger store.

So the question is, how practical is it for a mobile attacker to identify a 
block in its neighbours' stores (from the bloom filter), get connected to the 
neighbour's neighbours, download their bloom filters, and trace an insert or 
request back to source one hop at a time?

We will likely have requirements for uptime before we start transferring the 
bloom filter to a node. Assuming the attacker is not already connected, this 
could add a significant delay: we need to transfer the filter, limited by the 
node connected to's upstream and what it is prepared to use of it, and it 
shouldn't offer the bloom filter until we have had some amount of uptime, 
maybe 1 week? This adds a delay of approx 1 week per hop. In some cases we 
will have so few hops that this won't solve the problem; in other cases a 
node will have a small cache (or a huge one) so the trail for one key will go 
cold, although the attacker could try connecting to all its peers 
speculatively, or he could look for the other keys.

And assuming we don't cache at max HTL, it will only get you so far. But if 
you did follow the right breadcrumb trail, it should be possible to look for 
the other keys in nearby nodes, walk around the circle surrounding the 
originator, and thus maybe figure out which node was the originator, or at 
least narrow it down to a small group. With the current plan, we don't store 
a block until we are at least 2 hops away from the originator, with an 
average of 4 hops due to the probabilistic decrement of the HTL from maximum 
(with probability 50%). Whether to decrement is determined once on each node, 
although it is different with backtracking, so it is possible that it might 
be cached on a node immediately adjacent if there was lots of backtracking. 
If there is no backtracking, half of the originator's peers will have dropped 
the HTL, resulting in their peers caching the data; so it is surely possible 
to trace it, but it will probably require probing a largish number of nodes, 
probably hundreds, some dependant on the results of other nodes and some not; 
bandwidth is important to determine capacity here, but by this point the 
attacker should have a whole bunch of keys on the node. If he traced right 
back to the second hop (due to backtracking), he just needs the blooms for 
the nodes connected to the target. Since he doesn't know which is the target, 
that means a 2 hop radius - but there is likely to be overlap, so it may not 
be that many nodes. If he traced back to the third hop (the minimum without 
backtracking), he has a similar problem to determine the second hop, since we 
decremented on the second and on the third hops. Once he has found the second 
hop he can find other second hops and hence the common first hop by a similar 
means, since 50% of the second hops decremented. And to identify the 
originator, he just needs to find lots of first hops. The more connections he 
can have open at once, the easier this is going to be, this is limited by his 
bandwidth; in terms of timing, it is probably possible to avoid this taking 
the equivalent of many more hops, given reasonable bandwidth (it's a fair bet 
that the attacker has way more bandwidth than the average node e.g. because 
he hired a colo).

Is this worse than the current situation? Assume an attacker is trying to 
trace the originator of a large file, after the fact i.e. they didn't 
intercept the insert. They can ask each peer for blocks from the file. Timing 
will tell them which blocks were already cached locally. Some will answer 
even though they didn't cache it locally; this complicates matters by 
propagating those blocks a few hops. So connect to the nodes which had cached 
blocks locally already, and request those blocks from their peers, and maybe 
a few blocks we haven't tried yet. So the principle is very similar. Once we 
get close to the target, things are a bit easier than with the current plans, 
because all nodes cached the data, rather than having to guess at the second 
hops which didn't, and because the originator cached ALL of the data. But 
further out, Bloom filters do have some advantages to an attacker: Once we 
have fetched the bloom filters, we can identify all of the blocks that a peer 
had, without having to wait for load-limited probe requests, but on the other 
hand we have to wait for the transfer to finish and/or the uptime threshold. 
Transferring bloom filters in pieces may help an attacker, but only if there 
are very many blocks to be found. One difficulty is where to start: if we 
assume we are tracing after the event, we didn't intercept the original 
insert, so it's unlikely our peers were on the chain, and hence unlikely they 
have the data - but it is 20x more likely than we were, so it may be a good 
starting point nonetheless, especially if we run more than one node. We could 
in fact combine both attacks: we run a bunch of nodes, we are waiting for a 
file, we pick up on an hourly update that one of our peers was involved in 
it, we connect to our peer's peers and try both to intercept more blocks and 
to get adjacent nodes' bloom filters (or in the current code, to just probe 
for stuff, but that is slow). OTOH if we weren't running at the time, this 
introduces more delays, and the longer after the event we start the less 
likely we will succeed. In conclusion, it is probably a bit easier for an 
attacker...

How to improve security against such attacks? Increasing the uptime required 
before we will transfer bloom filters is an obvious measure. Reducing the 
average datastore size might help. :| Making it harder for a mobile attacker 
to connect to targeted peers is obviously a good thing for beating all these 
attacks. Starting to cache further away from the originator might help a lot. 
Probabilistic caching at any distance might help a bit. More average 
bandwidth will help somewhat. Also we should look at backtracking behaviour - 
clearly we do need to treat routing to the second choice node differently 
than routing to the first choice node (decide whether to decrement based on a 
different criterion), or requests will go to a huge number of hops in some 
cases. But this isn't too great from a security pov, as you may get a low 
enough HTL to cache on a node that is adjacent to the originator. A much 
bigger network would increase the number of hops involved (only log^2 
though), and the number of nodes to probe; on the current opennet of a few 
thousand nodes, a moderately powerful attacker could conceivably connect to 
everyone and get everyone's bloom filters. And of course tunnels would 
probably solve the problem, but at a rather massive performance cost - which 
may be a price worth paying for inserts, although not generally for requests.

Caching behaviour could be different for inserts. E.g. cache min 2 avg 4 hops 
away for requests, min 4 avg 6 hops away for inserts? Or even decrease the 
decrement probability for inserts, so e.g. min 4 avg 8 hops away for inserts? 
The wider the circle, the more nodes it contains, and the more uncertain the 
distance to the centre (due to different decrement probabilities), the more 
work an attacker has to do to find the centre. Of course, the fact of not 
caching might give away some information, especially for large cache-at hop 
numbers compared to the radius of the network...

IMHO we should probably go ahead with the current plans, with a slightly 
greater caching distance (lower cache-at-HTL threshold) for inserts than for 
requests, and with some sort of uptime threshold for transferring bloom 
filters. And implement tunnels later on, before 0.8 if we have time 
(depending on how long it takes to build bloom filters and debug db4o), after 
it if we don't.

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Bloom filters and store probing, after the fact splitfile originator tracing

Reply via email to