> the Inference Player won 51.4% against Strong Player.

This result is comparable to what I observed in Maven vs Maven games 
where one side used inferences.

In the Toronto Open, Maven played the first day using inferences, but 
then I disabled them for the subsequent games. I actually do not know 
whether it was better off using inferences or not, but I have always 
had some doubts about inferences when playing against humans, and the 
first day's games did little to encourage me.

The first problem with inferences against humans is that humans don't 
play "well" according to the computer model. Maven's static evaluation 
produces an ordering that has very high correlation to a simulator's 
ordering. The ordering by static evaluation is good enough that we can 
make sound inferences that have very high degrees of confidence. My 
tests show that this is not the case against humans, even when the 
humans are clearly championship caliber. When humans are weaker, then 
it is hard to draw any conclusions at all.

Just to be clear: I do not know whether using inferences results in 
stronger play when computers play humans. I do know that the posterior 
probability distributions become more dispersed, which reduces the 
value of inferences.

The second problem with inferences is that they cause your program to 
close the board. This is one of those annoying unexpected side-effects 
that gets you just when you think you have made a big improvement. The 
reason is that opponent's racks are better than random, so the 
simulation tries to counteract that.

Now, against a computer program this is probably good strategy. Quackle 
may even be particularly vulnerable to inferences (unsupported, but 
plausible, speculation here) because of its proclivity for fishing. But 
against a human any policy that closes the board will have a downside 
that probably cannot be made up by gaining a piddly 5 points.

Technical notes:

The 630-game match that resulted in an 18-game edge to the inference 
player is not statistically significant at p < 0.045. The inference 
player won only 9 games more than 50%, and the standard deviation over 
a 630-game match is 12.5 games. So the edge was not even a single 
standard deviation.

One the points-per-game calculation: They don't say what the standard 
deviation of the difference in points per game, but I have usually seen 
a 100 s.d. in computer games, so let's assume that. The s.d. over 630 
games would be around 4 points. So the 5.1 ppg edge is significant at p 
= 0.10.

All in all, this is "suggestive evidence." I accept the conclusion as 
valid because I got consistent results with Maven.

That being said, there is a much more direct way to calculate the 
effect of inferences. Just look for cases where the move decision 
changes, and then measure the actual effect of the change as the tiles 
actually lie.

Concretely: play simulated games without inferences. Then annotate the 
same positions with inferences enabled. If the inference engine prefers 
a different move, then play it out one turn for each side. The 
difference between what happened in the inference game and what 
happened in the actual game is the "edge" from inferences in that 
situation.

This methodology results in more precise measurement of the effect of 
inferences because

   1) You can actually see how often (or how rarely) the move changes.

   2) You can see the situations that cause a move to change, gaining 
insight that will help you to debug or extend the inference engine.

   3) You can see a numerical score for each inference: what did you 
gain (or lose) by following the inference.

   4) You have statistical isolation of the inference situation without 
noise from other moves. To get a sense for the advantage here: the 
standard deviation of two moves is about 30 points, versus 100 for a 
game. So you can reach conclusions with less than 1/9 as many games. 
(You pay for this by playing each game twice, but it is still a 
bargain.)

Sapphire Brand


Reply via email to