Hi, I might be having trouble understanding the self-play policy for
AlphaGo Zero. Can someone let me know if I'm on the right track here?

The paper states:

In each position s, an MCTS search is executed, guided by the neural
network f_θ . The
MCTS search outputs probabilities π of playing each move.


This wasn't clear at first since MCTS outputs wins and visits, but later
the paper explains further:

MCTS may be viewed as a self-play algorithm that, given neural
network parameters θ and a root position s, computes a vector of search
probabilities recommending moves to play, π =​  α_θ(s), proportional to
the exponentiated visit count for each move, π_a ∝​  N(s, a)^(1/τ) , where
τ is
a temperature parameter.


So this makes sense, but when I looked for the schedule for decaying the
temperature all I found was the following in the Self-play section of
Methods:


For the first 30 moves of each game, the temperature is set to τ = ​1; this
selects moves proportionally to their visit count in MCTS, and ensures a
diverse
set of positions are encountered. For the remainder of the game, an
infinitesimal
temperature is used, τ→​0.

This sounds like they are sampling proportional to visits for the first 30
moves since τ = ​1 makes the exponent go away, and after that they are
playing the move with the most visits, since the probability of the move
with the most visits goes to 1 and the probability of all other moves goes
to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 ,
b)^(1/τ) as τ goes to 0 from the right.

Am I understanding this correctly? I am confused because it seems a little
convoluted to define this simple policy in terms of a temperature. When
they mentioned temperature I was expecting something that slowly decays
over time rather than only taking two trivial values.

Thanks!
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to