On Sunday 06 September 2009 23:51:48 Evan Daniel wrote:
> I've been giving some thought to a plan for how to measure the
> performance of Freenet in a statistically valid fashion, with enough
> precision that we can assess whether a change helped or hurt, and by
> how much.  Previous changes, even fairly significant ones like FOAF
> routing and variable peer counts, have proved difficult to assess.
> These are my current thoughts on measuring Freenet; comments, whether
> general or specific, would be much appreciated.  The problem is hard,
> and my knowledge of statistics is far from perfect.  I'll be writing
> another email asking for volunteers to collect data shortly, but I
> want to do a little more with my stats collection code first.
> 
> Measuring Freenet is hard.  The common complaint is that the data is
> too noisy.  This isn't actually that problematic; extracting low-level
> signals from lots of noise just requires lots of data and an
> appropriate statistical test or two.  What makes testing Freenet
> really hard is that not only is the data noisy, collecting it well is
> difficult.  For starters, we have good reason to believe that there
> are strong effects of both time of day and day of week.  Node uptime
> may matter, both session uptime and past history.  Local node usage is
> likely to vary, and probably causes variations in performance with
> respect to remote requests as well.  Because of security concerns, we
> can't collect data from all nodes or even a statistically valid sample
> of nodes.
> 
> At present, my plan is to collect HTL histograms of request counts and
> success rates, and log the histograms hourly, along with a few other
> stats like datastore size, some local usage info, and uptime.  My
> theory is that although the data collection nodes do not represent a
> valid sample, the requests flowing into them should.  Specifically,
> node locations and request locations are well distributed, in a manner
> that should be entirely uncorrelated with whether a node is a test
> node or whether a request gets routed to a test node.  Higher
> bandwidth nodes route more requests overall, and node bandwidth
> probably shows sampling bias, but that should impact requests equally,
> independent of what key is being requested.  There may be some bias in
> usage patterns, and available bandwidth may create a bias among peers
> chosen that correlates with usage patterns and with being a test node.
>  In order to reduce these effects, I currently plan to use only the
> data from HTL 16 and below; in my experiments so far, on my node, the
> htl 18 and 17 data exhibits far more variation between sampling
> intervals.
> 
> My current plan for data collection goes like this.  Collect data from
> before a change, binned hourly.  When a new build is released, first
> give the network a day or three to upgrade and stabilize, ignoring the
> data during the upgrade period.  Then, collect some more data.  For
> each participating node, take the data from the set of hours of the
> week during which the node was running both before and after the
> change, and ignore other hours.  (If node A was running and gather
> data on Monday for the 09:00 hour both before and after, but only
> gathering data one week on Monday for the 10:00 hour, then we only
> look at the 09:00 hour data.)
> 
> Then, I need to perform some sort of non-parametric test on the data
> to see whether the 'before' data is different from the 'after' data.
> Currently I'm looking at one of Kruskal-Wallis one-way ANOVA, Wilcoxon
> signed-rank, or MWW.  I'm not yet sure which is best, and I may try
> several approaches.  I'll probably apply the tests to each distinct
> htl separately, with appropriate multiple-tests corrections to the
> p-values.
> 
> I also need to determine exactly what changes I expect to see.  For
> example, if a change makes the network better at finding data, then we
> expect more requests that are sent to succeed.  This may mean that
> success rates go up at all htls.  Or, it may mean that requests
> succeed earlier, meaning that the low-htl requests contain fewer
> requests for 'findable' data.  So an improvement to the network might
> result in a decrease in low-htl success rates.  Roughly speaking, a
> change that improves the number of hops required to find data should
> improve success rates at low htl and decrease them at low htl, but a
> change that means more data becomes findable should improve them at
> all htls.  I expect that most changes would be a mix of the two.
> Furthermore, I have to decide on how to treat local vs remote success
> rates.  The local success rate exhibits a strong bias with things like
> node age and datastore size.  However, the bias carries over into
> remote success rates as well -- more local successes means that
> requests that don't succeed will tend to be 'harder' requests.  Taking
> the global success rate is probably still heavily biased.
> 
> One approach would be to look only at the incoming request counts.
> Incoming request counts are only influenced by effects external to the
> node, and therefore less subject to sampling bias.  Averaged across
> the network, the decrease in incoming requests from one htl to the
> next (for the non-probabilistic drop htls, or with appropriate
> corrections) represents the number of requests that succeeded at the
> higher htl.  However, this does not account for rejected incoming
> requests, which decrement the htl at the sending node without
> performing a useful function.  (This will get even more complicated
> with bug 3368 changes.)
> 
> My current plan is to look at global success rates, as they combine
> whether the request has been routed to the right node (where it
> results in a local success) and whether it gets routed properly in the
> future (remote success).  As we expect new nodes to become better at
> serving requests as their cache and store fill up, I plan to only make
> use of data from established nodes (for some undecided definition of
> established).

Very interesting! We should have tried to tackle this problem a long time ago...

We have a few possible changes queued that might be directly detectable:
- A routing level change related to loop detection. Should hopefully increase 
success rates / reduce hops taken, but only fractionally. This may however be 
detectable given the above...
- Bloom filter sharing. Hopefully this will increase success rates at all 
levels, making more content available, however there is a small overhead.

There is also work to be done on the client level:
- MHKs (multiple top blocks).
- Various changes to splitfiles: add some extra check blocks for non-full 
segments, split them evenly, etc.
- Reinserting the top block on fetching a splitfile where it took some time to 
find it.

These should not have much of an effect on the routing level - if they have an 
effect it is probably negative. However they should have a significant positive 
effect on success rates.

So far the only tool we have for measuring this is the LongTermPushPullTest. 
This involves inserting a 64KB splitfle to an SSK and then pulling after 
(2^n)-1 days for various n (the same key is never requested twice).

This is currently giving significantly worse results for 3 days than for 7 days.

0 days (= push-pull test): 13 samples, all success. (100% success)
1 days: 11 success, 1 Splitfile error (meaning RNFs/DNFs). (92% success)
3 days: 4 success, 3 Data not found, 3 Not enough data found (40% success)
7 days: 6 success, 3 Data not found (2 DNFs not eligible because nothing was 
inserted 7 days prior) (66.7% success)
15 days: 1 success 1 Not enough data found (50% success)

There isn't really enough data here, but it is rather alarming.

Any input? How much data is enough data? Other ways to measure?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20090907/71ab3b9d/attachment.pgp>

Reply via email to