[Computer-go] Thought on LeelaZero training

ChtiGo via Computer-go Fri, 09 Mar 2018 07:53:57 -0800

Hi,


  

Generating self-play games represents smost of the computation burden in LZ 
project. With current setting, games are generated with a 1000 nodes/move 
budget. As a rough guide, considering 250 moves game length and ignoring resign 
and possible tree reuse, I assume generating a game, thus a training sample, 
costs ~250000 nodes in terms of computation. 

I am wondering if one can get more juice from this computation budget. 

* The Apprentice (policy head of the network) has some current playing strength 
(e.g. if used as greedy player). 
* MCTS guided by the Apprentice (policy and value heads) makes a stronger 
Player, which is used to generate the the games from which training positions 
will be sampled. In LZ, games are played by a Player using Np=1000 Nodes (Nd) 
budget for each move. 
* For each training sample, target for imitation learning of the Apprentice 
(policy head) is estimated by an Expert, stronger than the Apprentice. The 
Expert evaluates a position also based on a guided-MCTS search, using Ne nodes 
budget. In AGZ/AZ/LZ current setting, this Expert is just the Player itself, 
i.e. training targets derive from game playing data, i.e. visits count at the 
root for the sampled position. Ne=Np is indeed free lunch. 

A t some point in the training, when strength graph is plateauing, let aside 
increasing the network size once more, one might consider to: 
(1) sample training examples from games self-played by a stronger Player, by 
increasing the node budget per move Np (e.g. 1000Nd--> 2000 Nd); this would 
slow down the samples generation proportionaly. 
(2) keep Np unchanged and only increase the strength of the Expert. This could 
be done by pushing the MCTS search of the sampled position to a higher node 
budget Ne > Np. For example, self-playing games still at 1000Nd/moves but 
evaluating sampled positions with 25000Nd budget. An extra-cost of +10% on 
overall computation time, for (presumably) better targets, resulting in 
(possibly, yet to be proved) higher ELO gain per sample (from 1000Nd to 5000Nd 
targets would cost 'only' +2% overall cost). 

The bottom line is of course the return on investment: would the ELO gain 
speedup (per sample), if any, overcome the 10% slowdown in games generation 
rate ? What matters more: generating games at higher rate, increasing Player 
strength or increasing Expert strength ? Is there a sweet spot there ? This may 
vary a lot according to where we stand on the learning curve. 

The bottom line is of course the return on investment: would the ELO gain 
speedup (per sample), if any, overcome the 10% slowdown in games generation 
rate ? What matters more: generating games at higher rate, increasing Player 
strength or increasing Expert strength ? Is there a sweet spot there ? This may 
vary a lot according to where we stand on the learning curve. 



I am fully aware that AGZ/AZ papers definitely prove that super-human level can 
be reached with a (quite) low, constant node/move budget and Expert=Player 
setting, as now used by LZ. But with tremendous computation power on DeepMind 
side in the case of AZ. To reach the ~5200 ELO, 75% of AZ training (30/40 days) 
occured in an asymptotic region > 4500 ELO. 3 days to reach AlphaGo Lee level 
from scratch, but 27 extra-days to reach AlphaGo master level. If the goal of 
the project is to replicate AGZ/AZ up to Master level, plateauing will be daily 
bread for a very long time. Hence my thought-question. 

I didn't find any indication/attempt in the planning-based RL litterature 
dealing with a split between Apprentice / Player (or Actor) / Expert (or 
Critic). Only extreme cases like: 

- In "Thinking Fast and Slow with Deep Learning and Tree Search" ( 
https://arxiv.org/pdf/1705.08439.pdf ), Np=1 (greedy policy or stochastic 
policy), Player = Apprentice 
- In AlphaGo Zero or Alpha Zero papers, Ne=Np, tree search data from played 
games are re-used as Expert's evaluation (i.e. Expert = Player). 

A major drawback regarding this split, with current LZ distributed scheme (if I 
got it right) is that sampling of positions would have to be decided at worker 
level, once the game ended, and an extra MCTS search would have to be run with 
increased Ne nodes budget, locally at worker level, before pushing the data to 
the training server. Or improved targets would have to be generated a 
posteriori in a centralized manner on training server, which might represent 
unacceptable burden. 

Regards, 
Patrick

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] Thought on LeelaZero training

Reply via email to