Hi Timothy,

Here is a stats question I encounter from time to time.

Suppose I run N BG games and collect the average win rates and gammon
rates. 4 estimates which are dependent as they sum to 1.
How do I determine the confidence intervals for each? This is a 4d vector
and it seems like a non trivial Q, but I assume this crops up a lot and
must have a standard answer.
what is your take?

Thanks, Joseph


On Tue, 12 Nov 2019 at 15:17, Timothy Y. Chow <[email protected]>
wrote:

> Ian,
>
> Thanks for putting all this effort into a new MET!
>
> I don't know too much about the innards of GNU Backgammon, but I do know
> something about math and statistics.
>
> In terms of how many matches you would have to play between GNU-old-MET
> and GNU-new-MET, that depends on how much stronger GNU-new-MET is.
> Suppose that GNU-new-MET has a 51%/49% edge over GNU-old-MET.  That means
> that if you played 1000 matches, then you would expect a score of 510 to
> 490.  The problem is that if GNU-old-MET were playing against itself, the
> standard deviation would be about 15.8.  So a 510 to 490 result would be
> far from statistically significant.  You'd need about 10000 trials to
> barely reach statistical significance: The expected score would be 5100 to
> 4900 and the standard deviation would be 50, so 5100 would be two standard
> deviations away.  In general the formula for the standard deviation is
> sqrt(n)/2 where n is the number of matches.
>
> There's another point to be cognizant of, which is that there is a
> distinction between statistically significant evidence of the bare-bones
> claim that "the new MET is better," and a good estimate of *how* much
> stronger GNU-new-MET is than GNU-old-MET.  Let's say you played 10000
> matches and the score was 5100 to 4900.  You could then claim that the new
> MET is better, and say that this claim is significant at the two standard
> deviation level.  But you *couldn't* claim that you are 95% confident that
> the new MET gives you a 51%/49% edge over the old MET.  To get a good
> estimate of the edge requires more trials.  How many trials you need would
> depend on how sharp an estimate you want.
>
> I don't have as much insight into what might be going wrong with the
> cubeful calculations.  It does sound to me that there might be a problem
> with floating-point precision, but someone with knowledge of the code will
> have to comment on that.
>
> Tim
>
>

Reply via email to