Hi, On Thu, Aug 4, 2011 at 19:29, David Fotland <[email protected]> wrote: > Did each fuego play the same number of games vs gnugo, and did each play > half its games on each color?
Yes, I set up an all-play-all competition with gomill. On Thu, Aug 4, 2011 at 19:55, Erik van der Werf <[email protected]> wrote: > On Thu, Aug 4, 2011 at 6:57 PM, Vlad Dumitrescu <[email protected]> wrote: > The scores towards gnugo are almost >> identical, but the two fuegos score 449-415, which is 52% and the 95% >> confidence is ~3%, i.e. ~10 ELO. > > That 3% is not a 95% confidence interval, more like 1 standard > deviation... (so nothing with high confidence yet) I took the easy way out and used a formula mentioned by David Fotland on this list for a while ago >There is a simple formula to estimate the confidence interval of a result. >I use it to see if a new version is likely better than a reference version >(but I use 95% confidence intervals, so over hundred of experiments it gives >me the wrong answer too often). >1.96 * sqrt(wr * (1 - wr) / trials) >Where wr is the win rate of one version vs the reference, and trials is the >number of test games. On Thu, Aug 4, 2011 at 20:21, Kahn Jonas <[email protected]> wrote: > All the more since you're testing the same idea on two bots > simultaneaously. So if you want to be wrong at most five percent of the > time, and consider you are better as soon as one of the bots gets > better, you have to make individual tests at the 2.5% level. At the moment I ran the bots without any modification, to see if everything works fine. So I think that the results between the identical bots should have been closer to 50% or at least to swing sometimes to the other side of 50%. Right now it's 625-566, which is 52,5% and 2.83% confidence according to the formula above. The results are fuego-1.1 v fuego-new (1199/2000 games) unknown results: 1 0.08% board size: 9 komi: 6.5 wins black white avg cpu fuego-1.1 569 47.46% 386 64.33% 183 30.55% 2.69 fuego-new 629 52.46% 415 69.28% 214 35.67% 2.67 801 66.81% 397 33.11% I realize that statistic results don't always match what one would expect, but this should be a straightforward case... Thanks a lot for all the answers! regards, /Vlad _______________________________________________ Computer-go mailing list [email protected] http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
