On Sunday 06 September 2009 23:51:48 Evan Daniel wrote: > I've been giving some thought to a plan for how to measure the > performance of Freenet in a statistically valid fashion, with enough > precision that we can assess whether a change helped or hurt, and by > how much. Previous changes, even fairly significant ones like FOAF > routing and variable peer counts, have proved difficult to assess. > These are my current thoughts on measuring Freenet; comments, whether > general or specific, would be much appreciated. The problem is hard, > and my knowledge of statistics is far from perfect. I'll be writing > another email asking for volunteers to collect data shortly, but I > want to do a little more with my stats collection code first. > > Measuring Freenet is hard. The common complaint is that the data is > too noisy. This isn't actually that problematic; extracting low-level > signals from lots of noise just requires lots of data and an > appropriate statistical test or two. What makes testing Freenet > really hard is that not only is the data noisy, collecting it well is > difficult. For starters, we have good reason to believe that there > are strong effects of both time of day and day of week. Node uptime > may matter, both session uptime and past history. Local node usage is > likely to vary, and probably causes variations in performance with > respect to remote requests as well. Because of security concerns, we > can't collect data from all nodes or even a statistically valid sample > of nodes. > > At present, my plan is to collect HTL histograms of request counts and > success rates, and log the histograms hourly, along with a few other > stats like datastore size, some local usage info, and uptime. My > theory is that although the data collection nodes do not represent a > valid sample, the requests flowing into them should. Specifically, > node locations and request locations are well distributed, in a manner > that should be entirely uncorrelated with whether a node is a test > node or whether a request gets routed to a test node. Higher > bandwidth nodes route more requests overall, and node bandwidth > probably shows sampling bias, but that should impact requests equally, > independent of what key is being requested. There may be some bias in > usage patterns, and available bandwidth may create a bias among peers > chosen that correlates with usage patterns and with being a test node. > In order to reduce these effects, I currently plan to use only the > data from HTL 16 and below; in my experiments so far, on my node, the > htl 18 and 17 data exhibits far more variation between sampling > intervals. > > My current plan for data collection goes like this. Collect data from > before a change, binned hourly. When a new build is released, first > give the network a day or three to upgrade and stabilize, ignoring the > data during the upgrade period. Then, collect some more data. For > each participating node, take the data from the set of hours of the > week during which the node was running both before and after the > change, and ignore other hours. (If node A was running and gather > data on Monday for the 09:00 hour both before and after, but only > gathering data one week on Monday for the 10:00 hour, then we only > look at the 09:00 hour data.) > > Then, I need to perform some sort of non-parametric test on the data > to see whether the 'before' data is different from the 'after' data. > Currently I'm looking at one of Kruskal-Wallis one-way ANOVA, Wilcoxon > signed-rank, or MWW. I'm not yet sure which is best, and I may try > several approaches. I'll probably apply the tests to each distinct > htl separately, with appropriate multiple-tests corrections to the > p-values. > > I also need to determine exactly what changes I expect to see. For > example, if a change makes the network better at finding data, then we > expect more requests that are sent to succeed. This may mean that > success rates go up at all htls. Or, it may mean that requests > succeed earlier, meaning that the low-htl requests contain fewer > requests for 'findable' data. So an improvement to the network might > result in a decrease in low-htl success rates. Roughly speaking, a > change that improves the number of hops required to find data should > improve success rates at low htl and decrease them at low htl, but a > change that means more data becomes findable should improve them at > all htls. I expect that most changes would be a mix of the two. > Furthermore, I have to decide on how to treat local vs remote success > rates. The local success rate exhibits a strong bias with things like > node age and datastore size. However, the bias carries over into > remote success rates as well -- more local successes means that > requests that don't succeed will tend to be 'harder' requests. Taking > the global success rate is probably still heavily biased. > > One approach would be to look only at the incoming request counts. > Incoming request counts are only influenced by effects external to the > node, and therefore less subject to sampling bias. Averaged across > the network, the decrease in incoming requests from one htl to the > next (for the non-probabilistic drop htls, or with appropriate > corrections) represents the number of requests that succeeded at the > higher htl. However, this does not account for rejected incoming > requests, which decrement the htl at the sending node without > performing a useful function. (This will get even more complicated > with bug 3368 changes.) > > My current plan is to look at global success rates, as they combine > whether the request has been routed to the right node (where it > results in a local success) and whether it gets routed properly in the > future (remote success). As we expect new nodes to become better at > serving requests as their cache and store fill up, I plan to only make > use of data from established nodes (for some undecided definition of > established).
Very interesting! We should have tried to tackle this problem a long time ago... We have a few possible changes queued that might be directly detectable: - A routing level change related to loop detection. Should hopefully increase success rates / reduce hops taken, but only fractionally. This may however be detectable given the above... - Bloom filter sharing. Hopefully this will increase success rates at all levels, making more content available, however there is a small overhead. There is also work to be done on the client level: - MHKs (multiple top blocks). - Various changes to splitfiles: add some extra check blocks for non-full segments, split them evenly, etc. - Reinserting the top block on fetching a splitfile where it took some time to find it. These should not have much of an effect on the routing level - if they have an effect it is probably negative. However they should have a significant positive effect on success rates. So far the only tool we have for measuring this is the LongTermPushPullTest. This involves inserting a 64KB splitfle to an SSK and then pulling after (2^n)-1 days for various n (the same key is never requested twice). This is currently giving significantly worse results for 3 days than for 7 days. 0 days (= push-pull test): 13 samples, all success. (100% success) 1 days: 11 success, 1 Splitfile error (meaning RNFs/DNFs). (92% success) 3 days: 4 success, 3 Data not found, 3 Not enough data found (40% success) 7 days: 6 success, 3 Data not found (2 DNFs not eligible because nothing was inserted 7 days prior) (66.7% success) 15 days: 1 success 1 Not enough data found (50% success) There isn't really enough data here, but it is rather alarming. Any input? How much data is enough data? Other ways to measure? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 835 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20090907/71ab3b9d/attachment.pgp>
