Re: [Computer-go] Decision trees: making the most of each playout

Brian Sheppard Thu, 30 Jun 2011 10:33:37 -0700

I attempted that many years ago. It was ~2004, IIRC, before UCT/RAVE.

I started from the simulated annealing model from Bruegmann's paper. Using
data from that search, I tried to create decision trees along the lines you
said.


The problem was that performance was terrible. The cost of building the
trees was maybe 100x the cost of simulation. I could see how I might reduce
that to 10x, but overall it didn't seem promising. Then I heard about UCT
and MoGo and I figured that I was on the wrong path.

Maybe I just had the wrong implementation, or made poor decisions about the
feature set used by the decision tree.

I have since done studies where I collect data from Pebbles searches and
build decision trees that predict winning percentage. The trees seemed to
capture some skills, but it was unclear how to use that in the search
process.

Note that Pebbles is a much stronger player than my simulated annealing
player was, so its conclusions have less variance. And from that you get
better decision tree induction.

I still think something should work in this area. I will try again after
Pebbles catches up to state of the art. (That will take a while; I have
quite a backlog.)

Brian

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Peter Drake
Sent: Thursday, June 30, 2011 12:30 PM
To: [email protected]
Subject: [Computer-go] Decision trees: making the most of each playout

The general Monte Carlo approach is:

Repeat until golden brown:
        Perform a playout, guided by the current policy
        Determine the winner
        Adjust the policy

The policy is adjusted so that winning moves are played more often,  
losing moves less often (with some exploration thrown in). To make the  
most of each playout, the policy should generalize, so that a move  
that has done well in one situation should be considered good in  
"similar" situations. As discussed at length in our Power of  
Forgetting paper, much hinges on the definition of similar. At one  
extreme, all situations are considered similar; the earliest MC Go  
work did this. At the other extreme, every situation (complete game  
history) is considered unique; this is pure MCTS. In between lie more  
successful approaches like transposition tables, RAVE, and the Last  
Good Reply policy.

We would also like the policy to converge on an optimal policy given  
sufficient time (which might be possible at the end of a close game)  
and for it to be possible for the policy to be pre-initialized with  
domain-specific knowledge (ideally learned automatically from self- 
play or recorded games).

Here's our latest thought:

Begin at the "all situations are the same" extreme: gather a win rate  
for each point on the board, regardless of when it is played. This  
will generate data very quickly, because each playout generates  
(roughly) one datum for each point. As these piles of data expand,  
split them by context. For example, maybe (from some particular real  
board position) A1 is a bad move, but is an excellent move if B1 was  
the immediately preceding move. We would therefore split the data for  
A1 into two piles: "B1 was the previous move" and "B1 was not the  
previous move". This would continue, growing a tree in the fashion of  
traditional decision tree induction algorithms (ID3, C4.5, CART, etc.).

Advantages of this (as yet untried) approach:

1) Underexplored moves get many data on which to estimate their value.  
These data, being drawn from deep in playouts, will fluctuate  
considerably, providing useful exploration noise in the same way that  
RAVE does.
2) Given a sufficiently rich set of features on which to split, this  
would converge on a perfect policy. The entire history of the game is  
certainly a sufficiently rich set of features, but we could also add  
patterns, atari, etc.

Disadvantages:

1) It's not immediately clear how to pre-initialize the tree to take  
advantage of domain knowledge. Ideally this could be done through self- 
play or examining recorded games.

Comments?

Peter Drake
http://www.lclark.edu/~drake/



_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] Decision trees: making the most of each playout

Reply via email to