> the Inference Player won 51.4% against Strong Player. This result is comparable to what I observed in Maven vs Maven games where one side used inferences.
In the Toronto Open, Maven played the first day using inferences, but then I disabled them for the subsequent games. I actually do not know whether it was better off using inferences or not, but I have always had some doubts about inferences when playing against humans, and the first day's games did little to encourage me. The first problem with inferences against humans is that humans don't play "well" according to the computer model. Maven's static evaluation produces an ordering that has very high correlation to a simulator's ordering. The ordering by static evaluation is good enough that we can make sound inferences that have very high degrees of confidence. My tests show that this is not the case against humans, even when the humans are clearly championship caliber. When humans are weaker, then it is hard to draw any conclusions at all. Just to be clear: I do not know whether using inferences results in stronger play when computers play humans. I do know that the posterior probability distributions become more dispersed, which reduces the value of inferences. The second problem with inferences is that they cause your program to close the board. This is one of those annoying unexpected side-effects that gets you just when you think you have made a big improvement. The reason is that opponent's racks are better than random, so the simulation tries to counteract that. Now, against a computer program this is probably good strategy. Quackle may even be particularly vulnerable to inferences (unsupported, but plausible, speculation here) because of its proclivity for fishing. But against a human any policy that closes the board will have a downside that probably cannot be made up by gaining a piddly 5 points. Technical notes: The 630-game match that resulted in an 18-game edge to the inference player is not statistically significant at p < 0.045. The inference player won only 9 games more than 50%, and the standard deviation over a 630-game match is 12.5 games. So the edge was not even a single standard deviation. One the points-per-game calculation: They don't say what the standard deviation of the difference in points per game, but I have usually seen a 100 s.d. in computer games, so let's assume that. The s.d. over 630 games would be around 4 points. So the 5.1 ppg edge is significant at p = 0.10. All in all, this is "suggestive evidence." I accept the conclusion as valid because I got consistent results with Maven. That being said, there is a much more direct way to calculate the effect of inferences. Just look for cases where the move decision changes, and then measure the actual effect of the change as the tiles actually lie. Concretely: play simulated games without inferences. Then annotate the same positions with inferences enabled. If the inference engine prefers a different move, then play it out one turn for each side. The difference between what happened in the inference game and what happened in the actual game is the "edge" from inferences in that situation. This methodology results in more precise measurement of the effect of inferences because 1) You can actually see how often (or how rarely) the move changes. 2) You can see the situations that cause a move to change, gaining insight that will help you to debug or extend the inference engine. 3) You can see a numerical score for each inference: what did you gain (or lose) by following the inference. 4) You have statistical isolation of the inference situation without noise from other moves. To get a sense for the advantage here: the standard deviation of two moves is about 30 points, versus 100 for a game. So you can reach conclusions with less than 1/9 as many games. (You pay for this by playing each game twice, but it is still a bargain.) Sapphire Brand
