Re: [Computer-go] testing improvements

Vlad Dumitrescu Thu, 04 Aug 2011 13:15:58 -0700

Hi,

On Thu, Aug 4, 2011 at 19:29, David Fotland <[email protected]> wrote:
> Did each fuego play the same number of games vs gnugo, and did each play
> half its games on each color?


Yes, I set up an all-play-all competition with gomill.

On Thu, Aug 4, 2011 at 19:55, Erik van der Werf
<[email protected]> wrote:
> On Thu, Aug 4, 2011 at 6:57 PM, Vlad Dumitrescu <[email protected]> wrote:
>  The scores towards gnugo are almost
>> identical, but the two fuegos score 449-415, which is 52% and the 95%
>> confidence is ~3%, i.e. ~10 ELO.
>
> That 3% is not a 95% confidence interval, more like 1 standard
> deviation... (so nothing with high confidence yet)

I took the easy way out and used a formula mentioned by David Fotland
on this list for a while ago

>There is a simple formula to estimate the confidence interval of a result.
>I use it to see if a new version is likely better than a reference version
>(but I use 95% confidence intervals, so over hundred of experiments it gives
>me the wrong answer too often).
>1.96 * sqrt(wr * (1 - wr) / trials)
>Where wr is the win rate of one version vs the reference, and trials is the
>number of test games.

On Thu, Aug 4, 2011 at 20:21, Kahn Jonas <[email protected]> wrote:
> All the more since you're testing the same idea on two bots
> simultaneaously. So if you want to be wrong at most five percent of the
> time, and consider you are better as soon as one of the bots gets
> better, you have to make individual tests at the 2.5% level.

At the moment I ran the bots without any modification, to see if
everything works fine. So I think that the results between the
identical bots should have been closer to 50% or at least to swing
sometimes to the other side of 50%. Right now it's 625-566, which is
52,5% and  2.83% confidence according to the formula above.

The results are
fuego-1.1 v fuego-new (1199/2000 games)
unknown results: 1 0.08%
board size: 9   komi: 6.5
            wins              black          white        avg cpu
fuego-1.1    569 47.46%       386 64.33%     183 30.55%      2.69
fuego-new    629 52.46%       415 69.28%     214 35.67%      2.67
                              801 66.81%     397 33.11%

I realize that statistic results don't always match what one would
expect, but this should be a straightforward case...

Thanks a lot for all the answers!

regards,
/Vlad
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] testing improvements

Reply via email to