On Thursday 26 February 2009 11:53:53 Oskar Sandberg wrote:
> On Thu, Feb 26, 2009 at 11:28 AM, Matthew Toseland <
> [email protected]> wrote:
> 
> > 1 in 200 successful requests results in an insert.
> 
> We definitely need to have a list of all numbers like this (whose values are
> chosen more or less out of the blue).
> 
> // oskar

Okay, I'll give it a go ... I have BCC'ed to devl since it may be of interest 
to folk more widely. Really it should be on a wiki page...

Please let me know if there is a specific area I've missed out (there are 
probably many...) or you need more detail on!

ROUTING A REQUEST:
First, we check for recent valid offers for the key we are after. If any node 
has recently offered the key we are interested in, we send a special request 
to the node. This has a short timeout for starting the transfer, to 
discourage nodes from advertising keys they don't have (nodes can in any case 
only advertise keys we have requested from them in the recent past).

Otherwise, we decrement the HTL, route the request, and wait for an answer. 
There is a 10 second timeout for an Accepted. We can get a RejectedLoop, in 
which case we move on to the next node, or a RejectedOverload, which can be 
local or non-local: if it is local, we backoff that node and move on, if it 
is non-local we just pass it on for load limiting.

Once we have Accepted, we can get a DataNotFound (terminal failure, run out of 
HTL), a DataFound (for SSKs this is 3 messages, for CHKs it results in a 
block transfer, either way it is terminal), a RouteNotFound (reroute), or a 
timeout. RecentlyFound, a fatal failure with a specific timeout which we 
shouldn't retry before, may or may not be implemented in future as a 
throttling mechanism. We wait for 2 minutes for this.

Once we start a block transfer for a CHK, it can fail, resulting in a long 
backoff (separate from RejectedLoop backoff), and if the transfer takes more 
than 60 seconds we turn it into a "turtle" request i.e. we fail the 
downstream transfer and receive the data just for ourself (when/if we have 
fetched it we will offer the downstream nodes the data). There are severe 
limits on turtles: 10 total, 2 per key, 3 per peer, however this change 
appears to have greatly increased the maximum transfer rate achievable on 
fast nodes.

One in every 200 successful requests results in an insert for the data just 
downloaded. These are queued on the request starters like normal inserts, but 
at maximum priority.

LOAD LIMITING:
We keep a global AIMD window, that is we keep a "window size" number which is 
increased by some number when a request is successful, and decreased by some 
factor when a request is not successful. This is using a formula that should 
be compatible with TCP. Success includes DNF: only timeouts and 
RejectedOverload's whether local or relayed, and transfer failures, count as 
failures. For each type of requests (CHK fetch, SSK fetch, CHK insert, SSK 
insert) we keep a round-trip time.
Currently the constants are:
PACKET_DROP_DECREASE_MULTIPLE = 0.97f;
PACKET_TRANSMIT_INCREMENT = (4 * (1 - (PACKET_DROP_DECREASE_MULTIPLE * 
PACKET_DROP_DECREASE_MULTIPLE))) / 3;

However, we divide PACKET_TRANSMIT_INCREMENT by the window size when adding 
it, since in TCP we would increase the window size only after a full window 
had been transmitted.

Hence we try to guess the capacity of the network, based on the feedback sent 
back about individual requests, and not send any more requests than it can 
handle. We use a similar system for managing bandwidth on individual 
connections.

LOAD LIMITING:
We use various heuristics to decide whether to accept a request (or insert):
- Thread count arbitrary limit.
- Average message round trip time too high (cut off at 1500ms, proportional 
drop at 700ms); this is a good proxy for network/CPU load.
- Bandwidth limiting delay time over 2000-3000ms for data packets.
- Bandwidth liability limiting: track how many requests are running of each 
type (this may or may not make a distinction between local and remote 
requests depending on config), pretend they are all going to succeed and work 
out whether this will cause them to take more than 90 seconds with the output 
bandwidth limit minus the current estimated per-second overhead. Same for 
input, more or less (we ignore overhead for input). This is in practice the 
main cause of rejecting requests.
- Token buckets operating on the whole-request level based on the average 
number of bytes sent/received.
- Too many bytes queued to send to a specific node.

HTL:
HTL starts at 10. At startup, we decide whether to decrement at min/max for 
each connection and for no connection. At min (1), the probability of 
decrementing to 0 and failing/completing the request is 25% and at max (10) 
it is 10%. The first time we route a request, we decide this by the source 
node, after that it's by the routed to node.

ROUTING:
We find the node with the closest location (or the closest neighbour location 
i.e. FOAF), subject to the following caveats:
- If we have already routed to it, the request came from it, it is not 
connected, it is incompatible version-wise, don't route to it.
- If we are turtling this specific key from that node, don't route to it. (See 
below re turtling).
- If we have more than 5 connected, compatible peers, and the average number 
of times this node has been routed to since it connected is over 30%, don't 
route to it.

Various things are "advisory" i.e. they are taken into account but do not 
cause a request not to be routed at all:
- Backoff: A node is backed off for a period after it rejects a request; this 
is randomised and increases exponentially if no requests are accepted; a 
longer period is imposed for timeouts after a request has been accepted and 
transfer failures.
- Recent failures: After various kinds of failures we impose a timeout, until 
when we will try to avoid sending the same key to that node. This is part of 
per-node failure tables.

Combining these:
- If there are nodes which are both not backed off and not timed out, we route 
to whichever of those nodes is closest to the target location. If we are 
still here, all nodes are either backed off or timed out.
- If there are nodes which are timed out but not backed off, choose the node 
whose timeout expires soonest. Hence if a single key is requested 
continually, we round-robin between nodes. If we still don't have a winner, 
we know all nodes are backed off.
- If there are nodes which are backed off but not timed out, choose the node 
which is closest to the target but is not backed off. If we still don't have 
a winner, all nodes are backed off AND timed out.
- Choose the backed off node whose timeout expires soonest.

BACKOFF
We maintain two backoff periods, the main backoff and the transfer backoff. 
The initial value for these backoffs are 1 second and 1 minute. When a 
request is rejected by a node, and it is not backed off, we double the main 
backoff period, and set that node's backoff status for random * the backoff 
period (like Ethernet). When a request completes without being rejected, 
timing out etc, we reset the main backoff period. The transfer backoff period 
is similarly set when we have a transfer failure and unset when a transfer 
succeeds. Quick transfer failures are treated as DNFs, i.e. they result in a 
recently-failed timeout of 10 minutes, but don't cause a transfer backoff. A 
transfer failure is quick if the sender aborted it (rather than we timed 
out), and for certain failure reasons.

OFFERS/ULPRS/PER-NODE FAILURE TABLES
The failure table records, for 1 hour, which nodes have requested which key 
and which nodes we have routed it to, as well as any recent failure related 
timeout. Hence if we obtain a key which was recently requested, we offer it 
to all the nodes which asked for it or which we routed the request to. If we 
get an offer, we ignore it if we have the key in the datastore, or if we 
didn't recently ask for it. We accept if we want the key, or somebody has 
asked us for it and it is a CHK (since SSKs can be forged). We queue a 
request for the key at a priority depending on what client level request 
wants it and/or whether our peers want it. The objective is to quickly 
propagate popular data once it has been inserted, for example forum posts 
which usually rely on polling SSKs.

STORAGE:
There are 6 stores: a store and a cache for each of SSK, CHK and public key 
(for SSKs; we can avoid sending this over the network in many cases). Every 
block fetched is stored in the relevant cache. When an insert completes, it 
will be stored only if there is no node with a location closer to the key's 
location form than this node. But this calculation ignores peers with a lower 
reported uptime than 40% (this is simply an estimate computed by the node and 
sent to us on connection), and ignores backed off nodes unless all nodes are 
backed off. It does not take into account recently failed status etc. 
Arguably we should send uptime messages more often than on connection: this 
disadvantages high uptime nodes that connected when they were first 
installed. In practice the store fills up *much* slower than the cache.

CLIENT LAYER:
We have four request starters, one for each type of request, which wait a 
period of time dependant on load limiting, and then start the best request, 
subject to scheduling criteria: priority, retry count (if over 3), round 
robin over high-level request context, round robin over high-level requests 
within a context. After we have requested the same key 3 times, we move it 
(and all other requests for the same key) onto the "cooldown queue", and 
don't send any more (local) requests for that key for 30 minutes. However, 
whenever we get a key, by any means, we check all requests for whether they 
want that key, and will feed the block to the request if they do, even if it 
is on the cooldown queue or has a hopelessly low priority. Further, we never 
send two local requests for the same key or to insert the same block number 
for the same insert.

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Reply via email to