The UCT heuristic of trying every child of a node once before trying any child twice is reasonable when the payoff distribution is unknown. Why try the lever that paid $5 a second time if there might be another lever that pays $1,000,000? But when the set of possible payoffs is known to be {1, -1} -- you either win or you lose -- it makes no sense to abandon a move with a perfect winning record. Suppose there are 1000 candidate moves -- if the first move you try happens to win, are you going to waste the next 999 playouts on moves that (because you only try them once) can't possibly yield candidates with better records than the one you just found? The repeat-winners heuristic (keep trying a move until it loses) allows you to find exceptionally promising moves without having to identify 500 mildly promising ones first.
It may make sense to revisit promising children that have both won and lost at least once before visiting some other children at all. In a game with an infinite branching factor, this is obviously the case -- otherwise the tree would never grow past the first two levels, so you'd never advance past naive play -- and the same may hold if the branching factor is just very high, even if you have no a priori reason to prefer one child to another. (I *think* that's a difference from the iterative/progressive widening/unpruning. Heuristics are good; it's no surprise that a good pattern database leads to a much stronger MC player -- but using a progressive approach without heuristics requires an expansion strategy that won't just end up trying new children all the time like the repeat-winners heuristic when the children's pre-evaluations are all equal, or even worse, retry random old children without regard to their win/loss record). If you have already examined 100 child nodes, then you'd expect to have to examine about 100 more before you find a more promising candidate. Before doing that, it might make sense to expend some effort on reexamining the most promising candidates so far, to sort the really good candidates from the ones that started out on a lucky streak and to sort the robust moves from the ones that only work against inferior replies. As a dead-simple heuristic, one might, when there some children that remain unexpanded, allocate half of the effort to reexamining the most promising previously expanded children (not counting children with perfect winning records which would be immediately reexamined anyhow). This approach only doubles the number of playouts needed before all children are explored (to repeat myself, I don't consider the children to be properly explored until each child has lost at least once), but it can lead to a less trivial and hopefully smarter tree topology -- in short, it is more resistant to pathological behavior (such as never expanding any part of the tree past depth 2 because you're too busy trying new children) than the repeat-winners heuristic. The extra effort is not wasted once all children are explored, either, because the extra playouts are mostly performed on nodes that probably would soon be reexamined anyhow. It is also easy to imagine why MC programmers in games with {1, -1} payoff sets would discover that the choose-best-average-payoff heuristic is not as good as the choose-the-move-with-the-most-wins heuristic when selecting the best move to play. The older Mogo paper mentioned some plausible reasons, and it's easy to imagine others. For typical distributions, a 10-2 record is better than a 6-1 record, which is better than a 1-0 record. Finally, with the repeat-winners or progressive-widening-type (not sure what to call it; a quick glance at the papers suggested to me -- perhaps wrongly -- that it's not precisely either of the heuristics called progressive widening or progressive unpruning, but a more obvious and naive cousin of both) heuristics, the odd/even ply bias that Jason brought up is not likely to be much of an issue (whether or not it is ever a big issue for the try-every-child-once heuristic), because the search tree becomes imbalanced so quickly. My guess is that there is likely to very rapidly develop an expected roughly 50/50 mix of tree leaves at even and odd depths if wins and losses are roughly equally frequent even after minimaxing. So the normal UCT protocol makes sense with unknown sets of possible payout values, but it also makes sense to use slightly different choice heuristics when the payout value set is {1, -1}. Of course most of you have probably done that already, along with possibly more radical changes -- I know some of these heuristic suggestions are not new (because I copied them from this mailing list), and I would bet that none of them are. _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/