I've been giving some thought to a plan for how to measure the performance of Freenet in a statistically valid fashion, with enough precision that we can assess whether a change helped or hurt, and by how much. Previous changes, even fairly significant ones like FOAF routing and variable peer counts, have proved difficult to assess. These are my current thoughts on measuring Freenet; comments, whether general or specific, would be much appreciated. The problem is hard, and my knowledge of statistics is far from perfect. I'll be writing another email asking for volunteers to collect data shortly, but I want to do a little more with my stats collection code first.
Measuring Freenet is hard. The common complaint is that the data is too noisy. This isn't actually that problematic; extracting low-level signals from lots of noise just requires lots of data and an appropriate statistical test or two. What makes testing Freenet really hard is that not only is the data noisy, collecting it well is difficult. For starters, we have good reason to believe that there are strong effects of both time of day and day of week. Node uptime may matter, both session uptime and past history. Local node usage is likely to vary, and probably causes variations in performance with respect to remote requests as well. Because of security concerns, we can't collect data from all nodes or even a statistically valid sample of nodes. At present, my plan is to collect HTL histograms of request counts and success rates, and log the histograms hourly, along with a few other stats like datastore size, some local usage info, and uptime. My theory is that although the data collection nodes do not represent a valid sample, the requests flowing into them should. Specifically, node locations and request locations are well distributed, in a manner that should be entirely uncorrelated with whether a node is a test node or whether a request gets routed to a test node. Higher bandwidth nodes route more requests overall, and node bandwidth probably shows sampling bias, but that should impact requests equally, independent of what key is being requested. There may be some bias in usage patterns, and available bandwidth may create a bias among peers chosen that correlates with usage patterns and with being a test node. In order to reduce these effects, I currently plan to use only the data from HTL 16 and below; in my experiments so far, on my node, the htl 18 and 17 data exhibits far more variation between sampling intervals. My current plan for data collection goes like this. Collect data from before a change, binned hourly. When a new build is released, first give the network a day or three to upgrade and stabilize, ignoring the data during the upgrade period. Then, collect some more data. For each participating node, take the data from the set of hours of the week during which the node was running both before and after the change, and ignore other hours. (If node A was running and gather data on Monday for the 09:00 hour both before and after, but only gathering data one week on Monday for the 10:00 hour, then we only look at the 09:00 hour data.) Then, I need to perform some sort of non-parametric test on the data to see whether the 'before' data is different from the 'after' data. Currently I'm looking at one of Kruskal-Wallis one-way ANOVA, Wilcoxon signed-rank, or MWW. I'm not yet sure which is best, and I may try several approaches. I'll probably apply the tests to each distinct htl separately, with appropriate multiple-tests corrections to the p-values. I also need to determine exactly what changes I expect to see. For example, if a change makes the network better at finding data, then we expect more requests that are sent to succeed. This may mean that success rates go up at all htls. Or, it may mean that requests succeed earlier, meaning that the low-htl requests contain fewer requests for 'findable' data. So an improvement to the network might result in a decrease in low-htl success rates. Roughly speaking, a change that improves the number of hops required to find data should improve success rates at low htl and decrease them at low htl, but a change that means more data becomes findable should improve them at all htls. I expect that most changes would be a mix of the two. Furthermore, I have to decide on how to treat local vs remote success rates. The local success rate exhibits a strong bias with things like node age and datastore size. However, the bias carries over into remote success rates as well -- more local successes means that requests that don't succeed will tend to be 'harder' requests. Taking the global success rate is probably still heavily biased. One approach would be to look only at the incoming request counts. Incoming request counts are only influenced by effects external to the node, and therefore less subject to sampling bias. Averaged across the network, the decrease in incoming requests from one htl to the next (for the non-probabilistic drop htls, or with appropriate corrections) represents the number of requests that succeeded at the higher htl. However, this does not account for rejected incoming requests, which decrement the htl at the sending node without performing a useful function. (This will get even more complicated with bug 3368 changes.) My current plan is to look at global success rates, as they combine whether the request has been routed to the right node (where it results in a local success) and whether it gets routed properly in the future (remote success). As we expect new nodes to become better at serving requests as their cache and store fill up, I plan to only make use of data from established nodes (for some undecided definition of established). Evan Daniel
