The general Monte Carlo approach is:

Repeat until golden brown:
        Perform a playout, guided by the current policy
        Determine the winner
        Adjust the policy

The policy is adjusted so that winning moves are played more often, losing moves less often (with some exploration thrown in). To make the most of each playout, the policy should generalize, so that a move that has done well in one situation should be considered good in "similar" situations. As discussed at length in our Power of Forgetting paper, much hinges on the definition of similar. At one extreme, all situations are considered similar; the earliest MC Go work did this. At the other extreme, every situation (complete game history) is considered unique; this is pure MCTS. In between lie more successful approaches like transposition tables, RAVE, and the Last Good Reply policy.

We would also like the policy to converge on an optimal policy given sufficient time (which might be possible at the end of a close game) and for it to be possible for the policy to be pre-initialized with domain-specific knowledge (ideally learned automatically from self- play or recorded games).

Here's our latest thought:

Begin at the "all situations are the same" extreme: gather a win rate for each point on the board, regardless of when it is played. This will generate data very quickly, because each playout generates (roughly) one datum for each point. As these piles of data expand, split them by context. For example, maybe (from some particular real board position) A1 is a bad move, but is an excellent move if B1 was the immediately preceding move. We would therefore split the data for A1 into two piles: "B1 was the previous move" and "B1 was not the previous move". This would continue, growing a tree in the fashion of traditional decision tree induction algorithms (ID3, C4.5, CART, etc.).

Advantages of this (as yet untried) approach:

1) Underexplored moves get many data on which to estimate their value. These data, being drawn from deep in playouts, will fluctuate considerably, providing useful exploration noise in the same way that RAVE does. 2) Given a sufficiently rich set of features on which to split, this would converge on a perfect policy. The entire history of the game is certainly a sufficiently rich set of features, but we could also add patterns, atari, etc.

Disadvantages:

1) It's not immediately clear how to pre-initialize the tree to take advantage of domain knowledge. Ideally this could be done through self- play or examining recorded games.

Comments?

Peter Drake
http://www.lclark.edu/~drake/



_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Reply via email to