On Tue, May 4, 2010 at 9:31 AM, Matthew Toseland
<toad at amphibian.dyndns.org> wrote:
> On Thursday 18 February 2010 20:44:57 Evan Daniel wrote:
>> I've followed up my previous crude estimates of node churn with some
>> more detailed numbers. ?(See my mail in re: "data persistence again"
>> on 20100122 for previous version and more detailed explanation.)
>>
>> Again, some brief caveats: the following basically assumes that all
>> samples are independent. ?This is quite incorrect, because of time of
>> day effects. ?Nonetheless, I think it's useful. ?Many of the obvious
>> uses for this data ("If an insert is stored on 3 nodes, how likely is
>> it one of them will be online later?") are strongly impacted by this.
>> Use appropriate caution in analysis. ?Also, I have a few missing
>> samples; for each sample, I looked at the previous set of 24 samples
>> that I did have, whether or not those were contiguous.
>>
>> What I did: for each of the probe request samples, I computed how many
>> nodes appeared in n of the previous 24 samples (24 samples at 5 hour
>> intervals is a 5 day window). ?I then averaged these counts across
>> samples. ?If an average sample has N_i nodes appearing in i of the
>> previous 24 samples, then the average sample size over those 24 is
>> sum(N_i*(i/24)). ?Over the 387 samples (ignoring the first 23 where
>> there aren't a "most recent 24 samples"), I have an average sample
>> size of 5757.1 nodes. ?If we assume that each node is online with
>> probability i/24, and all nodes are independent (see previous caveat
>> about this assumption being incorrect), then the number of nodes that
>> are online in both of two different sampling intervals is
>> sum(N_i*(i/24)^2). ?For this number, I get 3511.5 nodes. ?That is, if
>> you select a random online node at some time t_1, the odds that it
>> will be online at some later time t_2 are about 0.610.
>>
>> I then repeated the above using the most recent 72 samples (15 days).
>> The distributions were roughly similar. ?Average sample size was
>> 5824.1, expected nodes online in both of two samples is 3106.8, or a
>> probability of 0.533 that a randomly chosen node will be online later.
>>
>> Nodes online in 24 of 24 samples make up 21.9% of an average sample.
>> Nodes online in 70, 71, or 72 samples make up 13.6%. ?Low-uptime nodes
>> (< 40% according to sink logic; here taken as <= 9 samples of 24 or <=
>> 27 of 72 (to make the 24/72 numbers directly comparable)) are 30.8% on
>> the 24-sample data, and 37.7% on the 72-sample data. ?I believe both
>> of these discrepancies result from join/leave churn, whether permanent
>> or over medium time periods (ie users who use Freenet for a couple
>> hours or days every few weeks).
>>
>> Evan Daniel
>>
>> (If you want the full spreadsheet or raw data, ask. ?The spreadsheet
>> was nearly 0.5 MiB, so I didn't attach it. ?The averaged counts are
>> below; this is enough to reproduce my calculations assuming samples
>> are independent.)
>>
> Some more analysis on this:
>
> [14:24:50] <evanbd> toad_: 5757 nodes online in an average sample. ?Taking
> high uptime as 23 or 24 samples, low uptime as 1-9 samples, and medium as
> 10-22...
> [14:25:52] <toad_> evanbd: the other question of course is how much
> redundancy can we get away with before it starts to be a problem ... that
> sort of depends on MHKs though
> [14:25:56] <evanbd> toad_: The high uptime group is 1505 nodes (1258 in
> 24/24). ?They have an average uptime of 99.3%.
> [14:26:23] <evanbd> toad_: The medium uptime group is 2478 nodes; they have
> an average uptime of 65%.
> [14:26:25] <toad_> if we don't have MHKs, the top block will always be
> grossly unreliable ...
> [14:26:38] <toad_> evanbd: this is by nodes typically online ?
> [14:26:47] <evanbd> toad_: And the low uptime group is 1774 nodes, with
> average uptime 22.9%.
> [14:27:51] <toad_> evanbd: okay, and this is by nodes online at an instant?
> [14:28:09] <evanbd> toad_: This is: Choose a random sample; choose a random
> node online in that sample. ?It will be a medium-uptime node with probability
> 2478/5757 (= 0.430). ?On average, its uptime will be 65%.
> [14:28:17] <evanbd> toad_: (In other words, yes)
> [14:28:31] <toad_> this is much better than i had expected
> [14:28:47] <evanbd> Well, by definition their uptime is > 40% :)
> [14:28:59] <toad_> so 26% have 99% uptime, 43% have 65% uptime, and 31% have
> 23% uptime
> [14:29:18] <toad_> right, but it means that nearly 70% of nodes online at any
> given time have 65%+ uptime
> [14:29:32] <toad_> i.e. we are *not* swamped with low uptime nodes
> [14:30:01] <toad_> at least if we consider a week ... this doesn't answer the
> question of try-it-and-leave
>
(09:31:32 AM) evanbd: toad_: No... 44% have uptime over 70% :)
(09:31:34 AM) toad_: evanbd: i've posted what you just said to devl
(09:31:52 AM) toad_: evanbd: ah :<
(09:32:09 AM) toad_: yeah, i see ...
(09:32:10 AM) evanbd: toad_: 26% have uptime between 40% and 70%
(09:32:32 AM) toad_: right, 70% have uptime >40% :|
Also, note that the above numbers are based on the same data set as
the original email: that is, they're not current.
Evan