On Tue, May 4, 2010 at 9:31 AM, Matthew Toseland <[email protected]> wrote: > On Thursday 18 February 2010 20:44:57 Evan Daniel wrote: >> I've followed up my previous crude estimates of node churn with some >> more detailed numbers. (See my mail in re: "data persistence again" >> on 20100122 for previous version and more detailed explanation.) >> >> Again, some brief caveats: the following basically assumes that all >> samples are independent. This is quite incorrect, because of time of >> day effects. Nonetheless, I think it's useful. Many of the obvious >> uses for this data ("If an insert is stored on 3 nodes, how likely is >> it one of them will be online later?") are strongly impacted by this. >> Use appropriate caution in analysis. Also, I have a few missing >> samples; for each sample, I looked at the previous set of 24 samples >> that I did have, whether or not those were contiguous. >> >> What I did: for each of the probe request samples, I computed how many >> nodes appeared in n of the previous 24 samples (24 samples at 5 hour >> intervals is a 5 day window). I then averaged these counts across >> samples. If an average sample has N_i nodes appearing in i of the >> previous 24 samples, then the average sample size over those 24 is >> sum(N_i*(i/24)). Over the 387 samples (ignoring the first 23 where >> there aren't a "most recent 24 samples"), I have an average sample >> size of 5757.1 nodes. If we assume that each node is online with >> probability i/24, and all nodes are independent (see previous caveat >> about this assumption being incorrect), then the number of nodes that >> are online in both of two different sampling intervals is >> sum(N_i*(i/24)^2). For this number, I get 3511.5 nodes. That is, if >> you select a random online node at some time t_1, the odds that it >> will be online at some later time t_2 are about 0.610. >> >> I then repeated the above using the most recent 72 samples (15 days). >> The distributions were roughly similar. Average sample size was >> 5824.1, expected nodes online in both of two samples is 3106.8, or a >> probability of 0.533 that a randomly chosen node will be online later. >> >> Nodes online in 24 of 24 samples make up 21.9% of an average sample. >> Nodes online in 70, 71, or 72 samples make up 13.6%. Low-uptime nodes >> (< 40% according to sink logic; here taken as <= 9 samples of 24 or <= >> 27 of 72 (to make the 24/72 numbers directly comparable)) are 30.8% on >> the 24-sample data, and 37.7% on the 72-sample data. I believe both >> of these discrepancies result from join/leave churn, whether permanent >> or over medium time periods (ie users who use Freenet for a couple >> hours or days every few weeks). >> >> Evan Daniel >> >> (If you want the full spreadsheet or raw data, ask. The spreadsheet >> was nearly 0.5 MiB, so I didn't attach it. The averaged counts are >> below; this is enough to reproduce my calculations assuming samples >> are independent.) >> > Some more analysis on this: > > [14:24:50] <evanbd> toad_: 5757 nodes online in an average sample. Taking > high uptime as 23 or 24 samples, low uptime as 1-9 samples, and medium as > 10-22... > [14:25:52] <toad_> evanbd: the other question of course is how much > redundancy can we get away with before it starts to be a problem ... that > sort of depends on MHKs though > [14:25:56] <evanbd> toad_: The high uptime group is 1505 nodes (1258 in > 24/24). They have an average uptime of 99.3%. > [14:26:23] <evanbd> toad_: The medium uptime group is 2478 nodes; they have > an average uptime of 65%. > [14:26:25] <toad_> if we don't have MHKs, the top block will always be > grossly unreliable ... > [14:26:38] <toad_> evanbd: this is by nodes typically online ? > [14:26:47] <evanbd> toad_: And the low uptime group is 1774 nodes, with > average uptime 22.9%. > [14:27:51] <toad_> evanbd: okay, and this is by nodes online at an instant? > [14:28:09] <evanbd> toad_: This is: Choose a random sample; choose a random > node online in that sample. It will be a medium-uptime node with probability > 2478/5757 (= 0.430). On average, its uptime will be 65%. > [14:28:17] <evanbd> toad_: (In other words, yes) > [14:28:31] <toad_> this is much better than i had expected > [14:28:47] <evanbd> Well, by definition their uptime is > 40% :) > [14:28:59] <toad_> so 26% have 99% uptime, 43% have 65% uptime, and 31% have > 23% uptime > [14:29:18] <toad_> right, but it means that nearly 70% of nodes online at any > given time have 65%+ uptime > [14:29:32] <toad_> i.e. we are *not* swamped with low uptime nodes > [14:30:01] <toad_> at least if we consider a week ... this doesn't answer the > question of try-it-and-leave >
(09:31:32 AM) evanbd: toad_: No... 44% have uptime over 70% :) (09:31:34 AM) toad_: evanbd: i've posted what you just said to devl (09:31:52 AM) toad_: evanbd: ah :< (09:32:09 AM) toad_: yeah, i see ... (09:32:10 AM) evanbd: toad_: 26% have uptime between 40% and 70% (09:32:32 AM) toad_: right, 70% have uptime >40% :| Also, note that the above numbers are based on the same data set as the original email: that is, they're not current. Evan _______________________________________________ Devl mailing list [email protected] http://osprey.vm.bytemark.co.uk/cgi-bin/mailman/listinfo/devl
