Currently our load management system works like this:
- At the level of any node, a request will be preferentially routed to the best 
node by location (or FOAF location), but if many nodes are backed off, we can 
route to the worst node by location. In other words there is no limit whatever 
on misrouting.
- At the level of the node originating requests, we attempt to estimate the 
capacity of the network in a manner similar to TCP (although more vague as we 
operate on rates rather than actual windows).

Hence we do not limit misrouting, but we do limit load. And limiting load on 
the sender side is very expensive and inflexible, costing us a great deal of 
performance.

Recently the CCN paper, and since long ago various people on Frost, have 
pointed out that we do not in fact need to limit load. We don't need to limit 
it at the sender side and arguably we don't need to do more at the any-node 
level than basic limiting of the number of requests in flight. This is because 
every time data is transferred it is the result of a request, so congestion in 
the sense that TCP/IP deals with cannot really occur.

Clearly we do need to limit the number of requests which are pending on any 
given node. If we don't, we will end up sharing a finite and often very small 
resource (the output bandwidth limit) among an unbounded and potentially very 
large number of requests.

Also it would be good to directly limit misrouting by refusing to send requests 
to a node that is too far from the ideal for the key we are requesting. 
However, some nodes will be always very slow, and some nodes will be 
temporarily very slow. IMHO these are two different issues: Nodes which are 
always very slow (e.g. due to being on dialup) should publish a very small 
capacity and be used occasionally when they can be used, and the fact that we 
can't use the very slow node for most requests that would in theory match it 
should not be a big concern with regards to misrouting. Whereas nodes which are 
temporarily very slow (e.g. temporarily suffering under massive CPU load) 
should just be ignored - they reject requests or time out, and we can backoff. 
Hence backoff should only happen in critical cases (e.g. high ping time), and 
most of the time load is limited by the published request limit, which takes 
over from output bandwidth liability preemptive rejection.

Even with these precautions we will need a heuristic for the degree of 
misrouting we will tolerate. Options:
- Always route to the ideal node, subject to the above limitations. Arguably 
this is best, provided we can exclude outliers in a satisfactory way.
- Allow routing to the top two or three nodes.
- Take a proportion of the routing table (e.g. top 25%).

The big remaining question is what do we do when a request cannot be 
immediately routed to a sufficiently good node. We can either reject it 
(causing downstream nodes to misroute and maybe find a better path), we can 
fail it (causing the originator to retry, possibly using a different key in the 
case of a splitfile), or we can queue it. IMHO in most cases queueing is the 
right answer, at least for bulk fetches (for realtime fetches e.g. fproxy 
fetching small files, we may want a "realtime" mode which is more likely to 
fail but won't be queued). When we accept a request, we decide which node to 
route it to. If that node has capacity we immediately send it and use up some 
of that capacity. If it doesn't, we wait: We grab a FIFO ticket so that our 
request can be sent as soon as capacity is available i.e. as soon as a request 
completes. We have a reasonably generous maximum queueing time after which we 
would timeout and fail the request. If the node becomes backed off due to a 
serious failure, or radically reduces its capacity, again we will probably fail 
the request.

If we want to allow *some* misrouting beyond the basic precautions given above 
(i.e. if we allow routing to the top two or three nodes, or the top 25% of the 
table), then this is a bit more complex, as we will have to decide which 
request to allow through based both on how long it has been queued and how 
close to ideal the node which now has capacity is for the request.

Could we simplify this proposal further by keeping backoff and preemptive 
rejection based on output bandwidth liability? Perhaps: We choose the ideal 
node. If it is backed off long-term beyond some threshold, we move on to the 
next node. If it is not backed off, we send the request. Otherwise we grab a 
FIFO ticket. When the node comes out of backoff, all requests waiting on the 
FIFO queue can send their requests. Unfortunately this doesn't work because 
there is no limit on how many requests can be sent at this point - so most of 
the requests sent will be rejected, and then either have to misroute to the 
next node, or wait for this node again. So explicit capacity announcement is an 
important part of the scheme.

But we can break it up into stages:

1. Explicit capacity announcement

The node already calculates its capacity in output bandwidth liability 
limiting, part of pre-emptive rejection. We take our output bandwidth limit, 
subtract a plausible estimate of overheads, and multiply it by the maximum 
transfer time which is acceptable. This gives us a value in bytes for the total 
data transferred if all requests succeed. We can do this calculation and 
regularly announce both the total and the currently occupied portion to all our 
peers. They then receive this data and use it to decide whether to send us a 
request by counting the requests sent since the last time they had an update. 
Of course, other nodes may get in first, in which case the requests will still 
be rejected. But it will still reduce the number of rejections.

2. More fairness

Divide the total global limit by our number of peers. This gives us a 
guaranteed limit for each peer. We promise not to send a new-style reject if 
the node is within its guaranteed limit, even if that temporarily puts us over 
our overall limit; if we do they can disconnect, backoff or flag us up as a 
problem (implement this in some form). We send the per-node limit with the 
other capacity data. When we get a request, we only consider the overall limit 
if we are over the per-node limit. This does mean we can exceed the overall 
limit by a factor of two, so we may want to adjust the overall limit 
accordingly. But it allows bursting and fairness at the same time, while 
significantly reducing rejections.

3. More stats

Calculate the median capacity, and maybe some percentiles. Show the whole 
ordered list on the stats page.

4. Backoff changes

Modify RejectedOverload: If we reject because of output bandwidth liability, 
send a different message, including our current status, and do not add the 
rejected UID to the list of already seen UIDs. Hence the node could retry if it 
wants to. On the recipient, accept the new rejection, but don't do anything 
with it yet, just treat it as an overload rejection.

5. Misrouting limiting and basic queueing

In RequestSender, calculate the ideal node to route to, taking into account 
backoff but ignoring capacity. Check whether we have capacity. If we do, send 
the request; if the request is rejected with an old style reject, backoff and 
reroute if allowed (else fail); if it is rejected with a new style reject, 
update the capacity estimate and do not backoff, and wait for more capacity (on 
that specific node). If there is no capacity, and the node is below K * median 
capacity (one half?), move on to the next node; otherwise wait: Grab a FIFO 
ticket from the node, wait for it to be fulfilled or to timeout. Timeout is 
initially set to 30 seconds. On a timeout we fail the request with a special 
error code. Whenever a capacity update comes in, if there is capacity we wake 
the first ticket holder.

Remove the AIMD-based sender rate limiting. Locally originated requests can now 
be treated as coming from just another node. Depending on the user's 
configuration we might allow more than that e.g. multiply the per-node limit 
for the local node by some factor. Implementation: We start a request from the 
client layer. Then if there is space left on its per-node limit, or on the 
overall limit, we start another request; if there isn't, we wait. We will be 
woken up when a request finishes or a node announces a positive change to its 
capacity. Repeat forever.

6. Realtime vs bulk flag

Add an extra flag to requests. If the flag is set to BULK, we proceed as 
before, with queueing and a relatively long timeout. If the flag is set to 
REALTIME, we need an instant decision, even if it means an instant failure. So 
we try to route to the top nodes - a slightly more generous heuristic than with 
bulk - and if we can't, we fail as if we had timed out. Another way to 
implement this would be to have a dedicated quota within the per-node limit for 
realtime requests. A third option would be to allow realtime requests to be 
queued but for a much shorter period, and use stats to estimate whether that 
will be exceeded and if so fail immediately.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100324/8b9b985f/attachment.pgp>

Reply via email to