Thank you Remi! So the 85.5% +/- 2.5 reported by GoGui would be 85.5% +/- 5 for 95% and 85.5% +/- 7.5. Correct?
And thanks for the table. I think that’s good enough for now. I’ve now figured out how to calculate the std. deviation myself (it is easy) and with those two tools together I can now see that 200 games is a bit on the low end. :) I had expected as much but it’s good to know for sure. Urban On Tue, Nov 3, 2015 at 9:46 AM, Rémi Coulom <[email protected]> wrote: > The intervals given by gogui are the standard deviation, not the usual 95% > confidence intervals. > > For 95% confidence intervals, you have to multiply the standard deviation > by two. > > And you still have the 5% chance of not being inside the interval, so you > can still get the occasional non-overlapping intervals. > > Likelihood of superiority is an interesting statistical tool: > https://chessprogramming.wikispaces.com/LOS+Table > > For more advanced tools for deciding when to stop testing, there is SPRT: > http://www.open-chess.org/viewtopic.php?f=5&t=2477 > https://en.wikipedia.org/wiki/Sequential_probability_ratio_test > > Rémi > > > On 11/03/2015 09:38 AM, Urban Hafner wrote: > >> So, >> >> I’m currently running 200 games against GnuGo to see if a change to my >> program made a difference. But I now wonder if that’s enough games as I ran >> the same benchmark with the same code (but a different compiler version) >> and received different results: >> >> 85.5% wins (171 games of 200) the first time (+/- 2.5 according to >> gogui-twogtp) >> 79.0% wins (158 games of 200) the second time (+/- 2.9 according to >> gogui-twogtp) >> >> Looking at these results would make me believe that the difference is >> significant (the intervals don’t overlap) but then the real difference is >> only 13 wins … >> >> My statistics knowledge is sketchy at best but assuming that what >> gogui-twogtp calculates is the 95% confidence interval (I’m pretty sure I’m >> mixing terms here) it could well be that the difference between the two >> runs above is just random. >> >> So, this leads me to two questions: >> >> 1. How many games do you normally run to test if a change is significant >> “enough”? >> 2. Any good resources on how to calculate these statistics (i.e. if I >> wanted to find the error margin for a 99% confidence interval)? >> >> Urban >> -- >> Blog: http://bettong.net/ >> Twitter: https://twitter.com/ujh >> Homepage: http://www.urbanhafner.com/ >> >> >> _______________________________________________ >> Computer-go mailing list >> [email protected] >> http://computer-go.org/mailman/listinfo/computer-go >> > > _______________________________________________ > Computer-go mailing list > [email protected] > http://computer-go.org/mailman/listinfo/computer-go -- Blog: http://bettong.net/ Twitter: https://twitter.com/ujh Homepage: http://www.urbanhafner.com/
_______________________________________________ Computer-go mailing list [email protected] http://computer-go.org/mailman/listinfo/computer-go
