[freenet-dev] A brief analysis of request htl distributions

Evan Daniel Thu, 03 Sep 2009 21:48:54 -0700

After some discussions with Matthew on IRC, I've started making an
attempt to gather meaningful performance statistics on the live
opennet.  After some consideration of what the minimal set of useful
stats gathering to add was, we decided on histogram data on accepted
incoming requests (ie remotely originated only) grouped by hops to
live.  Toad graciously added some stats collection, which you can see
on the stats page in the latest testing build.  Displayed is the count
of incoming requests for each htl value, along with the counts of
requests that succeeded locally and succeeded remotely (meaning the
request was forwarded onward and then succeeded downstream).  This
data is presented for both CHK and SSK requests.


A sample line:
18      57.711% (690,818,2613)  2.678% (50,69,4443)

In order, that's htl for this line, CHK overall success rate, local
success count, remote success count, and total count, followed by the
same stats for SSK requests.  Note that the data does not include
requests we received and rejected due to eg overload or loops.

My hope is that, in the medium term, I can develop statistical methods
to meaningfully evaluate Freenet performance in the real world, rather
than merely in simulation.  A number of significant changes have been
made, and more are planned, that should have an impact on performance
(routing, data retention, etc).  However, we have not applied a
scientific approach to evaluating the impact of these changes, largely
due to concerns that most data that is easy to gather is horribly
noisy, and therefore difficult to draw conclusions from.  Short term,
I hope to learn more about Freenet's operation as a network with
emergent properties; in the medium term, I hope to be able to evaluate
the performance impact of changes to routing, caching, and network
topology.

(An aside before I get to the real analysis: CHKs have a good success
rate; over 50% at high htl.  This is actually fairly good, considering
that I suspect the data is heavily skewed by re-requests for queued
data that will take many tries to find on average.  That is, I suspect
the success rate on first requests for CHKs is significantly higher
than the data can convey.  SSKs have an abysmal success rate, but the
vast majority of SSK successes occur on high-htl requests.  From this
I conclude that SSKs are actually quite reliable, but that the
majority of requests for them are for things like Frost or FMS
messages that have not yet been inserted.)

This data is awkward to work with for a variety of reasons; I think
there is actually quite a lot I can do with it, but teasing it apart
will take some care.  For example, the probabilistic htl causes some
weird effects.  The number of requests that traverse at least two
nodes should be strictly less than the number that traverse at least
one node (on average; we get a random sample and so might not see this
always).  This is not reflected directly in the data because of
probabilistic htl: requests spend on average two hops at htl=18, but
always spend only one hop at htl=17.  htl=1 exhibits a similar
behavior.  One would also expect to see higher global success rates at
htl=18 than htl=17, both because more nodes still remain to search,
and because more of the requests remaining at htl=17 are "hard"
requests.  However, observer bias muddies the data: a request that
succeeds at the first hop will be observed by only one node (at
htl=18).  A failing request, though, will be observed by an average of
two nodes at htl=18.  So observer bias means that the observed global
htl=18 success rate will be lower than the actual rate.

Before I attempted to draw any conclusions about success rates, I
decided to examine the simple histogram of total requests vs htl.  I
collected three sets of data, all from my node.  (Complete raw data
can be found at the end of this email.)  I then performed a simple
chi-squared test to check whether the distributions match; they don't.
 I can't actually give a p-value, as my spreadsheet exhibits an
underflow.  The result was a chi-square statistic of 1926.5 with 34
degrees of freedom.

Incoming request htl distribution varies across the samples.
Plausible causes include varied time of day, varied local node usage
(the data are for remote requests, but it might have an indirect
impact), and varied local network conditions.  By far the largest
variation between samples (as measured by contribution to the test
statistic) comes from the htl=18 and 17 data.  In what would normally
be very bad statistical practice, I tried removing those rows from the
data.  At this point, statistical significance was at merely
astronomical levels: a p-value of 1.31E-48 was obtained.  Given this
extreme a p-value, I am confident that corrections for performing
multiple tests on the data, and peeking at the data while performing
those tests, still leave a result that is highly significant.  (I have
not actually performed said corrections.)

>From this analysis, I conclude that gathering data that is
statistically useful and free of confounding factors will take some
effort.  I think the appropriate collection technique is to gather the
same basic data, but to group the samples into hourly sampling
intervals, and collect data across several nodes.  This would help
control for time of day effects and give some idea of how much node to
node variability exists.  In order to control for varied local usage
patterns and their effects, I think the number of local requests
originated during each hour should also be recorded, along with the
number of external requests rejected (both CHK and SSK for each of
those).

Comments on my proposed avenues for investigation would be much
appreciated, as would volunteers to collect data.  I think I need data
from a minimum of 5 nodes in order to confirm that there are not
drastic local effects, though more might be nice after I've done an
initial analysis.

Evan Daniel

Raw data:
Date and time given are when the data was collected (end of sampling
interval), and are Eastern US times.  Each sampling window is from
node start to data collection; the samples do not overlap.

20090902 23:55
nodeUptimeSession: 7h33m

HTL     CHKs    SSKs
18      58.225% 807 2060 4924   5.282% 97 130 4298
17      35.396% 848 702 4379    3.208% 61 79 4364
16      39.597% 552 687 3129    1.723% 29 28 3308
15      33.293% 413 409 2469    0.997% 14 15 2909
14      26.968% 263 388 2414    0.453% 9 3 2649
13      20.939% 206 316 2493    0.334% 4 5 2696
12      18.675% 139 295 2324    0.153% 2 2 2607
11      15.734% 121 191 1983    0.075% 0 2 2666
10      12.698% 93 139 1827     0.039% 1 0 2557
9       13.707% 101 129 1678    0.039% 0 1 2537
8       12.389% 91 105 1582     0.118% 2 1 2547
7       13.690% 119 88 1512     0.040% 1 0 2517
6       10.382% 91 53 1387      0.000% 0 0 2568
5       10.765% 118 34 1412     0.079% 1 1 2531
4       9.189% 97 39 1480       0.040% 1 0 2530
3       8.952% 102 27 1441      0.000% 0 0 2484
2       9.714% 115 21 1400      0.000% 0 0 2454
1       7.451% 369 58 5731      0.000% 0 0 10136
0       0.000% 0 0 0    0.000% 0 0 0


20090903 10:39
# nodeUptimeSession: 10h43m

HTL     CHKs    SSKs
18      42.236% 899 1016 4534   3.260% 103 122 6901
17      54.485% 887 838 3166    2.961% 98 106 6889
16      40.702% 730 697 3506    1.332% 33 23 4203
15      34.956% 555 687 3553    0.995% 21 15 3619
14      26.446% 345 588 3528    0.628% 9 13 3505
13      20.607% 240 493 3557    0.248% 7 2 3630
12      17.581% 199 423 3538    0.225% 6 2 3551
11      15.723% 145 352 3161    0.087% 2 1 3453
10      13.483% 143 253 2937    0.087% 0 3 3440
9       12.229% 130 197 2674    0.058% 1 1 3434
8       11.579% 122 161 2444    0.030% 1 0 3295
7       10.316% 137 101 2307    0.092% 3 0 3253
6       10.582% 121 110 2183    0.062% 2 0 3205
5       8.932% 117 77 2172      0.000% 0 0 3132
4       6.450% 87 48 2093       0.032% 0 1 3151
3       6.887% 105 43 2149      0.000% 0 0 3232
2       7.651% 118 40 2065      0.000% 0 0 3183
1       6.048% 395 63 7573      0.024% 3 0 12536
0       0.000% 0 0 0    0.000% 0 0 0


20090903 23:26
# nodeUptimeSession: 9h21m

HTL     CHKs    SSKs
18      57.711% 690 818 2613    2.678% 50 69 4443
17      51.751% 578 974 2999    4.397% 85 154 5436
16      45.261% 586 837 3144    2.360% 40 45 3601
15      37.086% 482 549 2780    1.490% 24 18 2819
14      29.037% 401 392 2731    0.793% 13 8 2649
13      22.300% 261 342 2704    0.646% 8 9 2630
12      18.798% 202 283 2580    0.263% 3 4 2664
11      16.346% 173 218 2392    0.155% 2 2 2579
10      14.424% 120 192 2163    0.160% 1 3 2502
9       13.246% 108 154 1978    0.041% 0 1 2430
8       10.585% 91 110 1899     0.126% 2 1 2390
7       9.163% 74 88 1768       0.089% 0 2 2246
6       9.879% 85 70 1569       0.042% 1 0 2363
5       7.888% 76 48 1572       0.000% 0 0 2283
4       8.026% 88 34 1520       0.043% 1 0 2319
3       7.708% 78 37 1492       0.000% 0 0 2205
2       5.706% 59 21 1402       0.044% 1 0 2283
1       5.158% 215 47 5079      0.000% 0 0 8713
0       0.000% 0 0 0    0.000% 0 0 0
_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

[freenet-dev] A brief analysis of request htl distributions

Reply via email to