[freenet-dev] Latency optimisation way forward

Matthew Toseland Wed, 14 Jan 2009 05:58:09 -0800

At the moment, having taken out the recent latency optimisation changes (that 
resulted in a massive cut in bandwidth usage), latency is way up:
- Median CHK request time 11.2 seconds.
- Mean 22-23 seconds.
- 41-42% of requests take more than 15 seconds to complete.
- 22-23% of requests take more than 30 seconds to complete.
- 7-9% of requests take more than 60 seconds to complete.


These figures are based on a sample of approx 11 hours overnight, after it 
became mandatory (may include some UOM), and a sample of half an hour around 
12ish. The two agree very closely. TheSeeker's node shows a 13 second median 
and a 27 second mean. You can get similar results by setting log level 
details to freenet.node.RequestSender:MINOR, then:

Just follow the internally updated median/mean:
$ tail --follow=name --retry fast/logs-dark/freenet-latest.log | 
grep "Successful CHK request took"

Grep for individual timings:
$ zgrep "Successful CHK request took" 
fast/logs-dark/freenet-1197-2009-01-14-0* 
fast/logs-dark/freenet-1197-2009-01-14-10-* | sed "s/^.*Successful CHK 
request took //" | sed "s/ average Median.*$//" > timings2.list

Sort them and view them in less to get percentiles etc:
$ cat timings.list | sort -n | less
(Use the -N option to show line numbers)

Get mean excluding outliers over some value:
$ cat timings.list | (total=0; count=0; while read x; do if test $x -gt 30000; 
then echo Over 30 seconds: $x; else count=$((count+1)); total=$((total+x)); 
fi; done; echo Total is $total count is $count average is $(( $total / 
$count )))


Yesterday (1196, transfer backoff and Saturday's throttling), these stats were 
a 4 second median and 8 second mean. The 90th percentile was 15-17 seconds 
yesterday and is 50-57 seconds today.

However on Tuesday (1195, Saturday's throttling but not transfer backoff), it 
was more like a 3 second median and a mean fluctuating a lot due to some high 
values every now and then, around 13 seconds later on when there was more 
data. Of course there are time of day effects. :|

The main result of yesterday's testing (transfer backoff on transfers taking 
more than 15 seconds) was that there was a vast amount of backoff, and even 
lower bandwidth usage than tuesday, presumably because lots of nodes are 
affected by a single slow transfer. Users reported less than half their 
backoff was due to transfer backoff, otoh ... it was over half for me for a 
while, but it reduced as a proportion over the day.

We could cut the average CHK request time significantly at the cost of a 
somewhat smaller proportion of requests failing at a given threshold and 
having to continue on the last hop only as a turtle-request; when the 
transfer completes, we would offer it to the nodes that have asked for it in 
the past.

Cutoff  % reduction in mean request time                % of requests turtled
15s             74-76%                                                  41-42%
30s             56-59%                                                  22-23%
45s             42-48%                                                  12-14%
60s             31-39%                                                  7-9%

Obviously whatever proportion of requests are turtled, the fproxy psuccess is 
likely to be reduced by that much. :| OTOH it shouldn't affect queued 
requests much.

IMHO the system is over-optimised for throughput at the moment. The fact that 
the mean didn't decrease on Tuesday (although some users are seeing much 
higher figures than the above quotes, probably transient though) is probably 
due to outliers perhaps related to the significant backoff resulting from the 
over-aggressive solutions I have tried so far. With Saturday's limiting 
turned off, the main limiter on the number of requests a node accepts is 
output bandwidth liability limiting, which works on the principle of assuming 
that every request in flight will succeed, and working out how many can be 
accepted if they must all complete in 90 seconds. We could probably reduce 
this to 60 without a significant adverse effect on bandwidth usage. 
Saturday's limiting works similarly but uses the average bytes used for a 
request i.e. it takes the short-term psuccess into account. It has a much 
shorter threshold (5 seconds), and doesn't try to compensate for overheads. 
It might be interesting to reinstate this with a much higher threshold (20 
seconds??). Hopefully the combination would make the above table more 
attractive: if the last column's values could be halved, for example, without 
severely impacting on bandwidth usage, the combination would be very 
attractive. IMHO turtling support (or at least much stricter transfer 
timeouts) is necessary for reasons of attack resistance; and the current 
proposal (in a previous mail) incorporates the best part of Ian's transfer 
backoff without flooding the network with backoff.

A last resort would be a bulk vs realtime flag on requests. Bulk requests 
could be handled separately from real-time requests. Real-time requests would 
have a higher transfer priority, but would be limited to some proportion of 
the overall bandwidth usage, would only tolerate fast transfers, and in 
future might be routed more quickly / queued for less time (and therefore to 
a less ideal target). Bulk requests would be optimised for psuccess primarily 
and then for throughput, tolerating reasonably long transfer times (but not 
the 48 minutes theoretically possible now!). This has been suggested in the 
past, obviously it costs us some request indistinguishability, but maybe the 
time for it is soon. Anyway a proper proposal would need to be fleshed out. 
Arguably ULPRs obsolete bulk requests.

pgpX1VVzOouLA.pgp
Description: PGP signature

_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] Latency optimisation way forward

Reply via email to