Deepmind's paper on Alphago Zero (AGZ) gave an unexpected and beautiful answer 
to the question I confusedly posted here a while ago: 


http://computer-go.org/pipermail/computer-go/2017-January/009828.html 
http://computer-go.org/pipermail/computer-go/2017-February/009894.html 

  

==> MCTS is indeed a powerfull policy improvement operator (for go at least). 
Sounds like you can lift yourself up by pulling on your shoelaces, as we say in 
France. 


What stikes me however is one aspect of the "from scratch" feature of this MCTS 
policy improvement by self-play, for the following reasons: 

  

If I got it well, the workhorse of this is: 

  

    Pull (p,v) towards (P,V)=MTCS(p,v,N), 

  

where (P,V) are the refined policy and value estimations resulting from the a 
MCTS search done with some exploration budget N (e.g. 1600 nodes as in Deepmind 
paper). If p and v already contain some go knowledge, it is concievable that 
MTCS(p,v,N) will provide a better target for improving (p,v). We know that MCTS 
even with pure random roll-out improves over random play. 

  

But, with randomly initialized (p,v), in the initial phase of learning, 
(P,V)=MTCS(p,v,N) stays completely agnostic of the goal of the game until at 
least a leaf node reaches a terminal 

position. In this starting phase, the learning 'signal' can only come from leaf 
terminal positions, for which the evaluation by the value network is replaced 
by ground truth = game outcome. 

  

Without terminal positions leaf nodes in the tree, the MCTS operator can 
definitely improved nothing from scratch. 

  

1) Getting some signal for value network: 

  

Now, consider the 1600 nodes exploration budget reported by Deepmind. A SL 
trained policy network will distribute mass density sparsely over a restricted 
number of moves and a 1600 nodes MCTS tree might be quite deep. Inversely, a 
randomly intitialized policy will (a priori, in my intuition) distribute 
probability over a much larger number of moves, at each level of the tree. 
Consequently, a 1600 nodes tree will probably be rather wide and not so deep. 
Let's assume it is on average no deeper than 40 (maybe I'm wrong here; but you 
can replace with 80, it wouldn't change my point). 

  

Since "AlphaGo Zero uses Tromp–Taylor scoring during MCTS simulations and 
self-play training", game lentgh might be quite long, much longer than the 
standard 200-300 moves of human. The paper indeed mentions: "Games terminate 
when both players pass or after 19 × 19 × 2 = 722 moves" (i.e. allowing many 
captures and recaptures). Let's assume a conservative 400 moves average game 
length. 

  

Assuming a uniform sampling of positions (as in the paper), this means that the 
trees developped from 90% of the sampled positions will contain no terminal 
positions and will thus be completely useless for bootstrapping (just 
pseudo-random noise vs the goal of the game). And for (P,V)=MTCS(p,v,N) to be a 
enhanced target over (p,v), the search tree should preferably contain not just 
one but a substantial number of terminal positions, which may retrict to a 
fewer percents the proportion of positions with a significant 'signal' on which 
to bootstrap (p,v). 

  

2) Getting some contrast to learn a policy 

  

Bootstrapping a policy network seems challenging, even with an ambryo of value 
network. Between strong player, in no major mistake happen, the game can keep 
balanced all through the game and score might be close to zero (-3.5 pts, +5pts 
e.g.) so that moves matters until the end and may drastically influence the 
outcome (+1/-1). 

Inversely, in AGZ early stage, since the games are self-played by a randomly 
initialized policy, they amounts more or less to a random walk and should 
result in a high variance in the scores. Most of the positions close to endgame 
(the one whith "high signal content") will be one-sided in terms of outcome, 
i.e. from that position, most terminal positions reached will have the same 
value, either +1 or -1. From the root position, most move will appear to have 
the same action value. There will be few "constrast" between possible moves 
untill the value network improves substantially. 

  

In brief, only a small fraction of the sampled positions will find some endgame 
signal through a 1600N (nodes) budget MTCS search, and even that fraction will 
be quite noisy.  E pur si muove! 

  

I had think before to the problem of learning go from scratch and had imagined, 
for bootstrapping a (p,v) policy and value NNs from scratch, something like: 

1) Generate a batch of self-play games until the end (through random player; 
good enough for a start) and score them with Tromp-Taylor rules; determine 
outcome z=+1/-1; 
2) Train the value network v on these terminal positions s, to predict outcome 
of the game, up to some accuracy. This would result a value network specialized 
in final endgame positions but would distill some first go knowledge in that 
network; 
3) Generate new games by repeating 1) 
4) Train (p,v), using the AGZ-like MCTS learning process, but restricting 
sampling to positions near the endgame (let say less than 20 moves before end 
game), so that a significant proportion of leaf nodes are terminal positions 
and thus recieved ground truth evaluation. This would teach the value network 
to generalized a bit, to almost finished game positions, and would allow to 
derived an embryo of policy network, also specialized in endgame. 

Then you can consider starting to pulling on you shoelaces: 
5) Repeat 3) & 4) while gradually opening the beam of endgame positions to 
earlier positions in the game; 
6) Once some criteria is matched (accuracy of prediction e.g.), replace random 
play in 3) by learned policy p and repeat 3) & 4) as long as you plateau. 

  

Of course, the foregoing is just wrong or at least not mandatory, as 
demonstrated by the Deepmind's result. Initial focus on endgame positions can 
simply be ignored. At least, Deepmind's paper does not mention applying any 
sampling bias towards endgame positions in the early stage of the learning 
process. 

  

And their graph (Fig.3) ELO vs Training time shows absolutely no ignition 
delay. Playing strengh increases from the very first hours. 

  

Still mysterious to me. But what a wonderful result ! 

  

Regards, 
Patrick 
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to