Nicks example is an excellent one and easy to understand. If you run a million coin tosses, it's not very likely the final result will be outside the 5% level but it's VERY likely that you can find 1 or more stopping points that would put it outside that level.
So one very bad thing you might be tempted to do with go testing is to stop the test as soon as you have a good result that is reported to be outside the 95% curve and conclude that you have "most likely" improved the program. In reality your real confidence is much lower than the 95% you think it is. Also, Dave Fotland pointed out that the error margin is two sided, so even if you run a fixed number of games between 2 players and get 95% it's not really that high - because both programs generate this uncertainty. Imagine that you are comparing two coins to see which one is "superior" in it's ability to generate heads. Assuming they are both unbiased, as you flip each and compare you don't just have 1 coin to deal with but two and even if one comes out correct the other may not - so you have error margins for both coins which means you need more math and more coin flips. Don On Fri, Aug 12, 2011 at 5:27 AM, Nick Wedd <[email protected]> wrote: > On 12/08/2011 08:24, Petr Baudis wrote: > >> On Thu, Aug 04, 2011 at 08:21:27PM +0200, Kahn Jonas wrote: >> >>> And I'm not even taking into account the fact that you want to continue >>> testing till you reach significance. That would again require you take a >>> lower level. >>> >> >> I have seen this claim multiple times and I would be interested in some >> more detailed argument - could someone elaborate, please? I tried >> looking over the wikipedia and other pages but couldn't find anything >> and I'm not sure how this could break things and more importantly, how >> much it would break things in practice. >> > > Suppose I have a coin which I believe is biassed in favour of heads. I > decide "I will toss it 100 times, count the number of heads, and look it up > in a statistical table. If it is significant at the 5% level, I will assume > it is biassed and tell people about it; otherwise, I will shut up and forget > about it." That might be a reasonable decision. > > But suppose instead I decided "I will toss it many times, counting the > number of heads. After each observation of heads, I will look the results so > far up in a statistical table. If it is significant at the 5% level, I will > assume it is biassed and tell people about it; otherwise, I will carry on > until I get bored." Eventually, you announce "I have tossed my coin 1573 > times, and I find that the excess of heads is significant at the 5% level." > People should be sceptical. The statistical test you used is based on the > assumption that you decided at the start to toss your coin 1573 times, but > that is not the procedure you actually used. Maybe your coin is biassed, > maybe it isn't; but the "5% level" you are claiming is based on a false > assumption. > > If you don't see why that it is false, consider this more extreme example. > "I will toss it 1000 times, look for the run of 100 tosses that has the most > heads, and look up the results for that run in a ststistical table and > announce its significance level." > > Nick > > > (I admit that my probability/statistics background is very very sketchy. >> One of the things I want to fix during my PhD studies. ;-) >> >> > > -- > Nick Wedd > [email protected] > > ______________________________**_________________ > Computer-go mailing list > [email protected] > http://dvandva.org/cgi-bin/**mailman/listinfo/computer-go<http://dvandva.org/cgi-bin/mailman/listinfo/computer-go> >
_______________________________________________ Computer-go mailing list [email protected] http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
