Hi, I might be having trouble understanding the self-play policy for AlphaGo Zero. Can someone let me know if I'm on the right track here?
The paper states: In each position s, an MCTS search is executed, guided by the neural network f_θ . The MCTS search outputs probabilities π of playing each move. This wasn't clear at first since MCTS outputs wins and visits, but later the paper explains further: MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π = α_θ(s), proportional to the exponentiated visit count for each move, π_a ∝ N(s, a)^(1/τ) , where τ is a temperature parameter. So this makes sense, but when I looked for the schedule for decaying the temperature all I found was the following in the Self-play section of Methods: For the first 30 moves of each game, the temperature is set to τ = 1; this selects moves proportionally to their visit count in MCTS, and ensures a diverse set of positions are encountered. For the remainder of the game, an infinitesimal temperature is used, τ→0. This sounds like they are sampling proportional to visits for the first 30 moves since τ = 1 makes the exponent go away, and after that they are playing the move with the most visits, since the probability of the move with the most visits goes to 1 and the probability of all other moves goes to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 , b)^(1/τ) as τ goes to 0 from the right. Am I understanding this correctly? I am confused because it seems a little convoluted to define this simple policy in terms of a temperature. When they mentioned temperature I was expecting something that slowly decays over time rather than only taking two trivial values. Thanks!
_______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go