Hi Timothy, Here is a stats question I encounter from time to time.
Suppose I run N BG games and collect the average win rates and gammon rates. 4 estimates which are dependent as they sum to 1. How do I determine the confidence intervals for each? This is a 4d vector and it seems like a non trivial Q, but I assume this crops up a lot and must have a standard answer. what is your take? Thanks, Joseph On Tue, 12 Nov 2019 at 15:17, Timothy Y. Chow <[email protected]> wrote: > Ian, > > Thanks for putting all this effort into a new MET! > > I don't know too much about the innards of GNU Backgammon, but I do know > something about math and statistics. > > In terms of how many matches you would have to play between GNU-old-MET > and GNU-new-MET, that depends on how much stronger GNU-new-MET is. > Suppose that GNU-new-MET has a 51%/49% edge over GNU-old-MET. That means > that if you played 1000 matches, then you would expect a score of 510 to > 490. The problem is that if GNU-old-MET were playing against itself, the > standard deviation would be about 15.8. So a 510 to 490 result would be > far from statistically significant. You'd need about 10000 trials to > barely reach statistical significance: The expected score would be 5100 to > 4900 and the standard deviation would be 50, so 5100 would be two standard > deviations away. In general the formula for the standard deviation is > sqrt(n)/2 where n is the number of matches. > > There's another point to be cognizant of, which is that there is a > distinction between statistically significant evidence of the bare-bones > claim that "the new MET is better," and a good estimate of *how* much > stronger GNU-new-MET is than GNU-old-MET. Let's say you played 10000 > matches and the score was 5100 to 4900. You could then claim that the new > MET is better, and say that this claim is significant at the two standard > deviation level. But you *couldn't* claim that you are 95% confident that > the new MET gives you a 51%/49% edge over the old MET. To get a good > estimate of the edge requires more trials. How many trials you need would > depend on how sharp an estimate you want. > > I don't have as much insight into what might be going wrong with the > cubeful calculations. It does sound to me that there might be a problem > with floating-point precision, but someone with knowledge of the code will > have to comment on that. > > Tim > >
