I've been giving some thought to a plan for how to measure the
performance of Freenet in a statistically valid fashion, with enough
precision that we can assess whether a change helped or hurt, and by
how much.  Previous changes, even fairly significant ones like FOAF
routing and variable peer counts, have proved difficult to assess.
These are my current thoughts on measuring Freenet; comments, whether
general or specific, would be much appreciated.  The problem is hard,
and my knowledge of statistics is far from perfect.  I'll be writing
another email asking for volunteers to collect data shortly, but I
want to do a little more with my stats collection code first.

Measuring Freenet is hard.  The common complaint is that the data is
too noisy.  This isn't actually that problematic; extracting low-level
signals from lots of noise just requires lots of data and an
appropriate statistical test or two.  What makes testing Freenet
really hard is that not only is the data noisy, collecting it well is
difficult.  For starters, we have good reason to believe that there
are strong effects of both time of day and day of week.  Node uptime
may matter, both session uptime and past history.  Local node usage is
likely to vary, and probably causes variations in performance with
respect to remote requests as well.  Because of security concerns, we
can't collect data from all nodes or even a statistically valid sample
of nodes.

At present, my plan is to collect HTL histograms of request counts and
success rates, and log the histograms hourly, along with a few other
stats like datastore size, some local usage info, and uptime.  My
theory is that although the data collection nodes do not represent a
valid sample, the requests flowing into them should.  Specifically,
node locations and request locations are well distributed, in a manner
that should be entirely uncorrelated with whether a node is a test
node or whether a request gets routed to a test node.  Higher
bandwidth nodes route more requests overall, and node bandwidth
probably shows sampling bias, but that should impact requests equally,
independent of what key is being requested.  There may be some bias in
usage patterns, and available bandwidth may create a bias among peers
chosen that correlates with usage patterns and with being a test node.
 In order to reduce these effects, I currently plan to use only the
data from HTL 16 and below; in my experiments so far, on my node, the
htl 18 and 17 data exhibits far more variation between sampling
intervals.

My current plan for data collection goes like this.  Collect data from
before a change, binned hourly.  When a new build is released, first
give the network a day or three to upgrade and stabilize, ignoring the
data during the upgrade period.  Then, collect some more data.  For
each participating node, take the data from the set of hours of the
week during which the node was running both before and after the
change, and ignore other hours.  (If node A was running and gather
data on Monday for the 09:00 hour both before and after, but only
gathering data one week on Monday for the 10:00 hour, then we only
look at the 09:00 hour data.)

Then, I need to perform some sort of non-parametric test on the data
to see whether the 'before' data is different from the 'after' data.
Currently I'm looking at one of Kruskal-Wallis one-way ANOVA, Wilcoxon
signed-rank, or MWW.  I'm not yet sure which is best, and I may try
several approaches.  I'll probably apply the tests to each distinct
htl separately, with appropriate multiple-tests corrections to the
p-values.

I also need to determine exactly what changes I expect to see.  For
example, if a change makes the network better at finding data, then we
expect more requests that are sent to succeed.  This may mean that
success rates go up at all htls.  Or, it may mean that requests
succeed earlier, meaning that the low-htl requests contain fewer
requests for 'findable' data.  So an improvement to the network might
result in a decrease in low-htl success rates.  Roughly speaking, a
change that improves the number of hops required to find data should
improve success rates at low htl and decrease them at low htl, but a
change that means more data becomes findable should improve them at
all htls.  I expect that most changes would be a mix of the two.
Furthermore, I have to decide on how to treat local vs remote success
rates.  The local success rate exhibits a strong bias with things like
node age and datastore size.  However, the bias carries over into
remote success rates as well -- more local successes means that
requests that don't succeed will tend to be 'harder' requests.  Taking
the global success rate is probably still heavily biased.

One approach would be to look only at the incoming request counts.
Incoming request counts are only influenced by effects external to the
node, and therefore less subject to sampling bias.  Averaged across
the network, the decrease in incoming requests from one htl to the
next (for the non-probabilistic drop htls, or with appropriate
corrections) represents the number of requests that succeeded at the
higher htl.  However, this does not account for rejected incoming
requests, which decrement the htl at the sending node without
performing a useful function.  (This will get even more complicated
with bug 3368 changes.)

My current plan is to look at global success rates, as they combine
whether the request has been routed to the right node (where it
results in a local success) and whether it gets routed properly in the
future (remote success).  As we expect new nodes to become better at
serving requests as their cache and store fill up, I plan to only make
use of data from established nodes (for some undecided definition of
established).

Evan Daniel

Reply via email to