[freenet-dev] Statistics and measurement plans

Evan Daniel Mon, 7 Sep 2009 14:51:51 -0400

On Mon, Sep 7, 2009 at 9:17 AM, Matthew
Toseland<toad at amphibian.dyndns.org> wrote:
> So far the only tool we have for measuring this is the LongTermPushPullTest. 
> This involves inserting a 64KB splitfle to an SSK and then pulling after 
> (2^n)-1 days for various n (the same key is never requested twice).
>
> This is currently giving significantly worse results for 3 days than for 7 
> days.
>
> 0 days (= push-pull test): 13 samples, all success. (100% success)
> 1 days: 11 success, 1 Splitfile error (meaning RNFs/DNFs). (92% success)
> 3 days: 4 success, 3 Data not found, 3 Not enough data found (40% success)
> 7 days: 6 success, 3 Data not found (2 DNFs not eligible because nothing was 
> inserted 7 days prior) (66.7% success)
> 15 days: 1 success 1 Not enough data found (50% success)
>
> There isn't really enough data here, but it is rather alarming.
>
> Any input? How much data is enough data? Other ways to measure?
>


The statistical test to use on this sort of data is a chi-squared test.
http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test

To calculate the p-value, you need to evaluate the chi-squared
distribution; gnumeric provides a chidist function that does that, as
do other spreadsheets.

For 5 rows and 2 columns, we have (2-1)*(5-1)=4 degrees of freedom.
The test statistic on all the data is 14.03, which gives a p-value of
0.0072.  That is, there's about a 0.72% chance of getting the above
results (or more extreme ones) by luck, if there is actually no
difference between the different time periods.  However, in this case
we have a small sample size; we should perhaps be applying Yates'
continuity correction:
http://en.wikipedia.org/wiki/Yates%27_correction_for_continuity

Re-calculating, I now get a chi-squared value of 9.12, and a p-value
of 0.0581.  That is, the data is significant only at the 5.8% level.
Normally this is not considered strong enough to conclude
significance.  However, Yates' correction is overly conservative; we
have enough samples overall to not apply it, but having so few samples
in some groups (15 days) is slightly worrisome.  I'm not sure what the
precise approach to take here is, but I believe the actual p-value to
be somewhere between the two numbers calculated.  I conclude that the
data is significant, but not highly significant.

Comparing days 3 to 7 is harder, because we have to correct for the
fact that you looked at the data before deciding which rows to compare
(had the data been different, you'd be saying that the 7 day was worse
than the 1 day, for example).  We start with the same chi-squared
test.  We get a test statistic of 1.35 with 1 degree of freedom, for a
p-value of 0.245.  At this point, we can stop without making the
correction to the p-value; we've failed to reject the null hypothesis
of no difference between 3-day and 7-day results.  More data
collection is in order.


Now, as for what to measure...

Do we think there's a meaningful difference between testing a 2+2
splitfile and testing 4 blocks independently?  I'd like to see the
block-level success rates not including healing; we can always
calculate the splitfile success odds from that, assuming we think the
blocks are independent.  I suggest testing the splitfile, and also
testing several individual blocks.

I'd like to see more than one such test per day (2-3, say) assuming
they don't take too long to run.

I think we should consider a test where we insert a splitfile or
collection of blocks every day, and then try to download the same
collection 2^n days after it was inserted.  I think this gives a
better model for a file that is inserted and gets downloaded, but
isn't particularly popular.  A file that never gets downloaded for
weeks after insert, and then is downloaded, is a less useful case,
imho.  (It's a worst-case; those are useful, but so are the merely bad
cases.)

I'd also like to see the same tests run between a fixed pair of
long-established nodes, rather than freshly bootstrapped nodes.
Again, fresh nodes is a worst case, but not the only interesting case.
 I think for the repeated downloads test, you need 1 node per time the
file is downloaded, which is a little harder.

Evan Daniel

[freenet-dev] Statistics and measurement plans

Reply via email to