The UCT heuristic of trying every child of a node once before trying
any child twice is reasonable when the payoff distribution is unknown.
Why try the lever that paid $5 a second time if there might be another
lever that pays $1,000,000? But when the set of possible payoffs is
known to be {1, -1} -- you either win or you lose -- it makes no sense
to abandon a move with a perfect winning record. Suppose there are
1000 candidate moves -- if the first move you try happens to win, are
you going to waste the next 999 playouts on moves that (because you
only try them once) can't possibly yield candidates with better
records than the one you just found? The repeat-winners heuristic
(keep trying a move until it loses) allows you to find exceptionally
promising moves without having to identify 500 mildly promising ones
first.

It may make sense to revisit promising children that have both won and
lost at least once before visiting some other children at all. In a
game with an infinite branching factor, this is obviously the case --
otherwise the tree would never grow past the first two levels, so
you'd never advance past naive play -- and the same may hold if the
branching factor is just very high, even if you have no a priori
reason to prefer one child to another. (I *think* that's a difference
from the iterative/progressive widening/unpruning. Heuristics are
good; it's no surprise that a good pattern database leads to a much
stronger MC player -- but using a progressive approach without
heuristics requires an expansion strategy that won't just end up
trying new children all the time like the repeat-winners heuristic
when the children's pre-evaluations are all equal, or even worse,
retry random old children without regard to their win/loss record). If
you have already examined 100 child nodes, then you'd expect to have
to examine about 100 more before you find a more promising candidate.
Before doing that, it might make sense to expend some effort on
reexamining the most promising candidates so far, to sort the really
good candidates from the ones that started out on a lucky streak and
to sort the robust moves from the ones that only work against inferior
replies. As a dead-simple heuristic, one might, when there some
children that remain unexpanded, allocate half of the effort to
reexamining the most promising previously expanded children (not
counting children with perfect winning records which would be
immediately reexamined anyhow). This approach only doubles the number
of playouts needed before all children are explored (to repeat myself,
I don't consider the children to be properly explored until each child
has lost at least once), but it can lead to a less trivial and
hopefully smarter tree topology -- in short, it is more resistant to
pathological behavior (such as never expanding any part of the tree
past depth 2 because you're too busy trying new children) than the
repeat-winners heuristic. The extra effort is not wasted once all
children are explored, either, because the extra playouts are mostly
performed on nodes that probably would soon be reexamined anyhow.

It is also easy to imagine why MC programmers in games with {1, -1}
payoff sets would discover that the choose-best-average-payoff
heuristic is not as good as the choose-the-move-with-the-most-wins
heuristic when selecting the best move to play. The older Mogo paper
mentioned some plausible reasons, and it's easy to imagine others. For
typical distributions, a 10-2 record is better than a 6-1 record,
which is better than a 1-0 record.

Finally, with the repeat-winners or progressive-widening-type (not
sure what to call it; a quick glance at the papers suggested to me --
perhaps wrongly -- that it's not precisely either of the heuristics
called progressive widening or progressive unpruning, but a more
obvious and naive cousin of both) heuristics, the odd/even ply bias
that Jason brought up is not likely to be much of an issue (whether or
not it is ever a big issue for the try-every-child-once heuristic),
because the search tree becomes imbalanced so quickly. My guess is
that there is likely to very rapidly develop an expected roughly 50/50
mix of tree leaves at even and odd depths if wins and losses are
roughly equally frequent even after minimaxing.

So the normal UCT protocol makes sense with unknown sets of possible
payout values, but it also makes sense to use slightly different
choice heuristics when the payout value set is {1, -1}. Of course most
of you have probably done that already, along with possibly more
radical changes -- I know some of these heuristic suggestions are not
new (because I copied them from this mailing list), and I would bet
that none of them are.
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to