[Computer-go] Decision trees: making the most of each playout

Peter Drake Thu, 30 Jun 2011 09:30:00 -0700

The general Monte Carlo approach is:

Repeat until golden brown:
        Perform a playout, guided by the current policy
        Determine the winner
        Adjust the policy

The policy is adjusted so that winning moves are played more often,losing moves less often (with some exploration thrown in). To make themost of each playout, the policy should generalize, so that a movethat has done well in one situation should be considered good in"similar" situations. As discussed at length in our Power ofForgetting paper, much hinges on the definition of similar. At oneextreme, all situations are considered similar; the earliest MC Gowork did this. At the other extreme, every situation (complete gamehistory) is considered unique; this is pure MCTS. In between lie moresuccessful approaches like transposition tables, RAVE, and the LastGood Reply policy.

We would also like the policy to converge on an optimal policy givensufficient time (which might be possible at the end of a close game)and for it to be possible for the policy to be pre-initialized withdomain-specific knowledge (ideally learned automatically from self-play or recorded games).


Here's our latest thought:

Begin at the "all situations are the same" extreme: gather a win ratefor each point on the board, regardless of when it is played. Thiswill generate data very quickly, because each playout generates(roughly) one datum for each point. As these piles of data expand,split them by context. For example, maybe (from some particular realboard position) A1 is a bad move, but is an excellent move if B1 wasthe immediately preceding move. We would therefore split the data forA1 into two piles: "B1 was the previous move" and "B1 was not theprevious move". This would continue, growing a tree in the fashion oftraditional decision tree induction algorithms (ID3, C4.5, CART, etc.).


Advantages of this (as yet untried) approach:

1) Underexplored moves get many data on which to estimate their value.These data, being drawn from deep in playouts, will fluctuateconsiderably, providing useful exploration noise in the same way thatRAVE does.2) Given a sufficiently rich set of features on which to split, thiswould converge on a perfect policy. The entire history of the game iscertainly a sufficiently rich set of features, but we could also addpatterns, atari, etc.


Disadvantages:

1) It's not immediately clear how to pre-initialize the tree to takeadvantage of domain knowledge. Ideally this could be done through self-play or examining recorded games.


Comments?

Peter Drake
http://www.lclark.edu/~drake/



_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

[Computer-go] Decision trees: making the most of each playout

Reply via email to