If your database is composed of self-play games, then the likelihood 
maximization policy should gain strength rapidly, and there should be a way to 
have asymptotic optimality. (That is, the patterns alone will play a perfect 
game in the limit.)

 

Specifically: play self-play games using an asymptotically strong search 
policy, such as UCT, and then add all of the moves played by winning players to 
the training set. (Never train to the moves played by losers if your goal is 
asymptotic optimality.)

 

The intuition about why this results in strong play: at any moment in self-play 
training, the search engine is always working on situations that it believes 
are most equal for both sides. But then it turns out that one side is better 
and wins, so that data is added to the training set. Reinforcement learning 
then biases the pattern base to make the winning pathway more likely in future 
searches. The winning alternatives will be easier to see in future self-play 
games. Inevitably the losing player will prefer a different path, and the 
learning process continues.

 

Regarding asymptotic optimality: obviously you need a pattern set that could 
potentially be asymptotically optimal. For example, a general decision tree, 
rather than any fixed field of view. If you have such a thing, then the proof 
for asymptotic optimality goes along these lines: first show that the tree 
search and pattern base combination will eventually memorize the winning moves 
at terminal nodes, and then apply the same reasoning recursively. 

 

Regarding trying this: I was just starting to code this when AlphaGo debuted, 
and then I stopped working on Go. So: no, I have not tried this.

 

 

From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Álvaro Begué
Sent: Saturday, February 11, 2017 11:44 PM
To: computer-go <computer-go@computer-go.org>
Subject: [Computer-go] Playout policy optimization

 

Hi,

 

I remember an old paper by Rémi Coulom ("Computing Elo Ratings of Move Patterns 
in the Game of Go") where he computed "gammas" (exponentials of scores that you 
could feed to a softmax) for different move features, which he fit to best 
explain the move probabilities from real games.

 

Similarly, AlphaGo's paper describes how their rollout policy's weights are 
trained to maximize log likelihood of moves from a database.

 

However, there is no a priori reason why imitating the probabilities observed 
in reference games should be optimal for this particular purpose.

 

I thought about this for about an hour this morning, and this is what I came up 
with. You could make a database of positions with a label indicating the result 
(perhaps from real games, perhaps similarly to how AlphaGo trained their value 
network). Loop over the positions, run a few playouts and tweak the move 
probabilities by some sort of reinforcement learning, where you promote the 
move choices from playouts whose outcome matches the label, and you discourage 
the move choices from playouts whose outcome does not match the label.

 

The point is that we would be pushing our playout policy to produce good 
estimates of the result of the game, which in the end is what playout policies 
are for.

 

Any thoughts? Did anyone actually try something like this?

 

Cheers,

Álvaro.

 

 

 

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to