At the moment, having taken out the recent latency optimisation changes (that resulted in a massive cut in bandwidth usage), latency is way up: - Median CHK request time 11.2 seconds. - Mean 22-23 seconds. - 41-42% of requests take more than 15 seconds to complete. - 22-23% of requests take more than 30 seconds to complete. - 7-9% of requests take more than 60 seconds to complete.
These figures are based on a sample of approx 11 hours overnight, after it became mandatory (may include some UOM), and a sample of half an hour around 12ish. The two agree very closely. TheSeeker's node shows a 13 second median and a 27 second mean. You can get similar results by setting log level details to freenet.node.RequestSender:MINOR, then: Just follow the internally updated median/mean: $ tail --follow=name --retry fast/logs-dark/freenet-latest.log | grep "Successful CHK request took" Grep for individual timings: $ zgrep "Successful CHK request took" fast/logs-dark/freenet-1197-2009-01-14-0* fast/logs-dark/freenet-1197-2009-01-14-10-* | sed "s/^.*Successful CHK request took //" | sed "s/ average Median.*$//" > timings2.list Sort them and view them in less to get percentiles etc: $ cat timings.list | sort -n | less (Use the -N option to show line numbers) Get mean excluding outliers over some value: $ cat timings.list | (total=0; count=0; while read x; do if test $x -gt 30000; then echo Over 30 seconds: $x; else count=$((count+1)); total=$((total+x)); fi; done; echo Total is $total count is $count average is $(( $total / $count ))) Yesterday (1196, transfer backoff and Saturday's throttling), these stats were a 4 second median and 8 second mean. The 90th percentile was 15-17 seconds yesterday and is 50-57 seconds today. However on Tuesday (1195, Saturday's throttling but not transfer backoff), it was more like a 3 second median and a mean fluctuating a lot due to some high values every now and then, around 13 seconds later on when there was more data. Of course there are time of day effects. :| The main result of yesterday's testing (transfer backoff on transfers taking more than 15 seconds) was that there was a vast amount of backoff, and even lower bandwidth usage than tuesday, presumably because lots of nodes are affected by a single slow transfer. Users reported less than half their backoff was due to transfer backoff, otoh ... it was over half for me for a while, but it reduced as a proportion over the day. We could cut the average CHK request time significantly at the cost of a somewhat smaller proportion of requests failing at a given threshold and having to continue on the last hop only as a turtle-request; when the transfer completes, we would offer it to the nodes that have asked for it in the past. Cutoff % reduction in mean request time % of requests turtled 15s 74-76% 41-42% 30s 56-59% 22-23% 45s 42-48% 12-14% 60s 31-39% 7-9% Obviously whatever proportion of requests are turtled, the fproxy psuccess is likely to be reduced by that much. :| OTOH it shouldn't affect queued requests much. IMHO the system is over-optimised for throughput at the moment. The fact that the mean didn't decrease on Tuesday (although some users are seeing much higher figures than the above quotes, probably transient though) is probably due to outliers perhaps related to the significant backoff resulting from the over-aggressive solutions I have tried so far. With Saturday's limiting turned off, the main limiter on the number of requests a node accepts is output bandwidth liability limiting, which works on the principle of assuming that every request in flight will succeed, and working out how many can be accepted if they must all complete in 90 seconds. We could probably reduce this to 60 without a significant adverse effect on bandwidth usage. Saturday's limiting works similarly but uses the average bytes used for a request i.e. it takes the short-term psuccess into account. It has a much shorter threshold (5 seconds), and doesn't try to compensate for overheads. It might be interesting to reinstate this with a much higher threshold (20 seconds??). Hopefully the combination would make the above table more attractive: if the last column's values could be halved, for example, without severely impacting on bandwidth usage, the combination would be very attractive. IMHO turtling support (or at least much stricter transfer timeouts) is necessary for reasons of attack resistance; and the current proposal (in a previous mail) incorporates the best part of Ian's transfer backoff without flooding the network with backoff. A last resort would be a bulk vs realtime flag on requests. Bulk requests could be handled separately from real-time requests. Real-time requests would have a higher transfer priority, but would be limited to some proportion of the overall bandwidth usage, would only tolerate fast transfers, and in future might be routed more quickly / queued for less time (and therefore to a less ideal target). Bulk requests would be optimised for psuccess primarily and then for throughput, tolerating reasonably long transfer times (but not the 48 minutes theoretically possible now!). This has been suggested in the past, obviously it costs us some request indistinguishability, but maybe the time for it is soon. Anyway a proper proposal would need to be fleshed out. Arguably ULPRs obsolete bulk requests.
pgpX1VVzOouLA.pgp
Description: PGP signature
_______________________________________________ Devl mailing list [email protected] http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
