On Tue, 3 Nov 2015, Urban Hafner wrote:

Thank you Remi!
So the 85.5% +/- 2.5 reported by GoGui would be 85.5% +/- 5 for 95% and 85.5% 
+/- 7.5.
Correct?

Correct.
But you do not need that intervals do not overlap for significativity.
You may divide by $\sqrt{2}$ those intervals before testing if they
overlap (in the limit, of course, but the whole discussion till here has been).
The value $\sqrt{2}$ is when you have played the same number of games,
as you have in your example.
More precisely, you are computing a confidence interval on the
difference of expectations. You would need a few corrections to be
perfectly rigorous, but that should be enouhgh for your needs.

Jonas


And thanks for the table. I think that’s good enough for now. I’ve now figured 
out how
to calculate the std. deviation myself (it is easy) and with those two tools 
together
I can now see that 200 games is a bit on the low end. :) I had expected as much 
but
it’s good to know for sure.

Urban

On Tue, Nov 3, 2015 at 9:46 AM, Rémi Coulom <[email protected]> wrote:
      The intervals given by gogui are the standard deviation, not the usual 95%
      confidence intervals.

      For 95% confidence intervals, you have to multiply the standard deviation
      by two.

      And you still have the 5% chance of not being inside the interval, so you
      can still get the occasional non-overlapping intervals.

      Likelihood of superiority is an interesting statistical tool:
      https://chessprogramming.wikispaces.com/LOS+Table

      For more advanced tools for deciding when to stop testing, there is SPRT:
      http://www.open-chess.org/viewtopic.php?f=5&t=2477
      https://en.wikipedia.org/wiki/Sequential_probability_ratio_test

      Rémi

      On 11/03/2015 09:38 AM, Urban Hafner wrote:
      So,

      I’m currently running 200 games against GnuGo to see if a change to
      my program made a difference. But I now wonder if that’s enough
      games as I ran the same benchmark with the same code (but a
      different compiler version) and received different results:

      85.5% wins (171 games of 200) the first time (+/- 2.5 according to
      gogui-twogtp)
      79.0% wins (158 games of 200) the second time (+/- 2.9 according to
      gogui-twogtp)

      Looking at these results would make me believe that the difference
      is significant (the intervals don’t overlap) but then the real
      difference is only 13 wins …

      My statistics knowledge is sketchy at best but assuming that what
      gogui-twogtp calculates is the 95% confidence interval (I’m pretty
      sure I’m mixing terms here) it could well be that the difference
      between the two runs above is just random.

      So, this leads me to two questions:

      1. How many games do you normally run to test if a change is
      significant “enough”?
      2. Any good resources on how to calculate these statistics (i.e. if
      I wanted to find the error margin for a 99% confidence interval)?

      Urban
      --
      Blog: http://bettong.net/
      Twitter: https://twitter.com/ujh
      Homepage: http://www.urbanhafner.com/


_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go


_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go




--
Blog: http://bettong.net/
Twitter: https://twitter.com/ujh
Homepage: http://www.urbanhafner.com/



_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to