On Saturday 27 Aug 2011 21:35:58 Ian Clarke wrote:
> Matthew,
> 
> This makes sense from the perspective of making incremental changes, but I
> think we need to be more drastic than that.  I think we need to go back to
> the drawing board with load management.  We need to find a solution that is
> simple enough to reason about, and to debug if we have problems with it.
> 
> The entire approach of coming up with hypotheses about what is wrong,
> building a solution based on these hypotheses (without actually confirming
> that the hypotheses are accurate) and deploying it is deja vu, we've been
> doing it for a decade, and we still haven't got load management right.
>  We're just layering more complexity onto a system that we already don't
> understand, based on guesses as to what the problems were with the previous
> iteration that we can't test because the system is too complicated with too
> many interactions for anyone to get their heads around it.

That is because we do not have the time or funding to empirically test 
hypotheses. We don't gather enough data, we don't have a huge testnet to try 
stuff on over extended periods, and so on. Most software development has 
deadlines and resource limitations, but most software development doesn't NEED 
to be empirical, at least not very often! See my other mail about this.
> 
> If something isn't working, and we don't understand for definite what is
> wrong with it, then we shouldn't build more on top of it in the hope that
> we'll accidentally solve the problem, we should replace it with something
> simple enough that we do understand it, right?
> 
> The purpose of load management is relatively simple: *Don't allow clients to
> pump more requests into the network than the network can handle, while
> ensuring that this workload is distributed across the network in a
> reasonably efficient manner.  This must be done in a decentralized way that
> is not vulnerable to abuse.*
> *
> The current load management system includes many different interacting
> components that make it nearly impossible to understand or debug.  I think
> we need to go back to the drawing board with load management, starting from
> the goal I cite above.
> 
> I would invite people to suggest the simplest possible load management
> schemes that might work, let's then discuss them, and figure out which is
> most likely to work, and if it doesn't work, which will be easiest to debug.
> 
> We can bear in mind a few lesson's we've learned though.  What
> characteristics should our load balancing system have?
> 
>    - Dropping requests should be extremely rare because this
>    just exacerbates overloading
>    - Delaying requests should also be extremely rare for the same reason
>    - Misrouting requests should be limited, but perhaps acceptable
>    occasionally, for the same reason

Misrouting is unacceptable, in general. Extremely overloaded or extremely low 
capacity nodes may be routed around. We might even allow some bounded amount of 
misrouting in the more general case (e.g. go to either of the top two peers for 
the key). But in general, transforming load into misrouting (or into reduced 
HTL, or any other bogus escape valve) is a bad idea. We need to reduce the 
incoming load.
> 
> It therefore really comes down to nodes anticipating whether the network is
> in danger of having to drop, delay, or misroute requests, and reduce the
> rate at which they are pumping requests into the network if this danger
> point is reached.
> 
> So how do nodes inform each-other that they are at risk of having to drop,
> delay, or misroute requests?  There are two reasons this might happen.
>  Either a node itself is close to its own capacity for relaying requests, or
> the other nodes it relays to are themselves at or close to capacity.
> 
> One problem I'm concerned about when nodes share information about how
> overloaded their peers are is a "gridlock":
> 
> What if all nodes told other nodes that they were overloaded because all
> their peers are overloaded.  Such a situation would basically cause the
> entire network to think it was overloaded, even though nobody actually was!
>  It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can this
> be avoided?

Exactly. There are two possible approaches:
1. Nodes control the rate at which *local* requests are initiated, based on how 
many slow-down signals they get.

(This ensures that such gridlock or feedback loop problems can't happen)

The main problems are security:
- Deliberate flooding: Malicious nodes not following AIMD's, in order to clog 
up the network.

Fair sharing can limit this. Limiting backoff so that we only misroute when 
nodes are severely overloaded can also help. We could queue (provided this 
happens only rarely and briefly), allow bounded misrouting (e.g. route to one 
of the top two routes), or terminate requests (thus freeing up resources).

- Opportunistic flooding: Users hack their nodes to not follow AIMD's because 
it speeds up their downloads.

This is harder than solving deliberate flooding but also less urgent. Fair 
sharing helps, but the slow-down signals, or rejections/terminations, may be 
generated a long way away from the originator. In which case, they are only 
proportionately affected, meaning it's still beneficial for them to flood. It 
may be that this isn't resolvable and we'll have to continue to rely on a 
certain amount of altruism; ask people to not use the Fr33n3t L33t L34ch Cl13nt 
cos it slows things down for everyone else, should this arise.

- Local privacy: We can maybe identify which requests are local based on 
timings.

This is somewhat speculative; while it is probably exploitable, there are much 
easier attacks if you are directly connected to the target.

- Remote privacy: We can maybe correlate different bundles of requests based on 
timings.

This is speculative, but it could be solved by using separate AIMD's for 
separate unconnected bundles of requests (we will need some sort of 
tagging/grouping mechanism like this when we have tunnels anyway).

2. Nodes control the rate at which remote requests are initiated too, 
propagating information on the capacity of the network. For instance, they 
could estimate the capacity of the network and generate slow-down signals when 
incoming requests are over that capacity (and eventually terminate requests); 
this is self-feeding because their estimate of capacity depends on *incoming* 
slow-down messages!

The main problem with this is that feedback between nodes could well result in 
the estimated capacity collapsing to zero or whatever the minimum is. It is 
probably possible to surmount that but would certainly require a great deal of 
theoretical (simulation) work.

We can reuse existing code to a significant degree:
- RejectedOverload should be split up. It already represents a slow-down 
message (provided local=false). It should also indicate whether this is due to 
an actual rejection or to using more than expected resources.
- The easiest way to implement #1 is with AIMD's. We can keep them, but send 
slow-down messages earlier.
- We could keep NLM, i.e. queue for limited times to smooth out delays. 
Currently we RNF after a (longish) period waiting for a node to route to. We 
would send a slow-down message when we have queued for more than a certain 
period.  This means we can have a fixed, limited amount of misrouting, we can 
keep NLM's clear benefits for routing accuracy (as visible in success rates), 
and we can ensure that the input load is low enough to avoid severe slowdown 
due to queueing for a long time.
> 
> Ian.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20110829/b007f4b4/attachment.pgp>

Reply via email to