The general Monte Carlo approach is:
Repeat until golden brown:
Perform a playout, guided by the current policy
Determine the winner
Adjust the policy
The policy is adjusted so that winning moves are played more often,
losing moves less often (with some exploration thrown in). To make the
most of each playout, the policy should generalize, so that a move
that has done well in one situation should be considered good in
"similar" situations. As discussed at length in our Power of
Forgetting paper, much hinges on the definition of similar. At one
extreme, all situations are considered similar; the earliest MC Go
work did this. At the other extreme, every situation (complete game
history) is considered unique; this is pure MCTS. In between lie more
successful approaches like transposition tables, RAVE, and the Last
Good Reply policy.
We would also like the policy to converge on an optimal policy given
sufficient time (which might be possible at the end of a close game)
and for it to be possible for the policy to be pre-initialized with
domain-specific knowledge (ideally learned automatically from self-
play or recorded games).
Here's our latest thought:
Begin at the "all situations are the same" extreme: gather a win rate
for each point on the board, regardless of when it is played. This
will generate data very quickly, because each playout generates
(roughly) one datum for each point. As these piles of data expand,
split them by context. For example, maybe (from some particular real
board position) A1 is a bad move, but is an excellent move if B1 was
the immediately preceding move. We would therefore split the data for
A1 into two piles: "B1 was the previous move" and "B1 was not the
previous move". This would continue, growing a tree in the fashion of
traditional decision tree induction algorithms (ID3, C4.5, CART, etc.).
Advantages of this (as yet untried) approach:
1) Underexplored moves get many data on which to estimate their value.
These data, being drawn from deep in playouts, will fluctuate
considerably, providing useful exploration noise in the same way that
RAVE does.
2) Given a sufficiently rich set of features on which to split, this
would converge on a perfect policy. The entire history of the game is
certainly a sufficiently rich set of features, but we could also add
patterns, atari, etc.
Disadvantages:
1) It's not immediately clear how to pre-initialize the tree to take
advantage of domain knowledge. Ideally this could be done through self-
play or examining recorded games.
Comments?
Peter Drake
http://www.lclark.edu/~drake/
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go