Your understanding matches mine. My guess is that they had a temperature
parameter in the code that would allow for things like slowly transitioning
from random sampling to deterministically picking the maximum, but they
ended up using only those particular values.

Álvaro.




On Tue, Nov 7, 2017 at 1:07 PM, Imran Hendley <imran.hend...@gmail.com>
wrote:

> Hi, I might be having trouble understanding the self-play policy for
> AlphaGo Zero. Can someone let me know if I'm on the right track here?
>
> The paper states:
>
> In each position s, an MCTS search is executed, guided by the neural
> network f_θ . The
> MCTS search outputs probabilities π of playing each move.
>
>
> This wasn't clear at first since MCTS outputs wins and visits, but later
> the paper explains further:
>
> MCTS may be viewed as a self-play algorithm that, given neural
> network parameters θ and a root position s, computes a vector of search
> probabilities recommending moves to play, π =​  α_θ(s), proportional to
> the exponentiated visit count for each move, π_a ∝​  N(s, a)^(1/τ) , where
> τ is
> a temperature parameter.
>
>
> So this makes sense, but when I looked for the schedule for decaying the
> temperature all I found was the following in the Self-play section of
> Methods:
>
>
> For the first 30 moves of each game, the temperature is set to τ = ​1; this
> selects moves proportionally to their visit count in MCTS, and ensures a
> diverse
> set of positions are encountered. For the remainder of the game, an
> infinitesimal
> temperature is used, τ→​0.
>
> This sounds like they are sampling proportional to visits for the first 30
> moves since τ = ​1 makes the exponent go away, and after that they are
> playing the move with the most visits, since the probability of the move
> with the most visits goes to 1 and the probability of all other moves goes
> to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 ,
> b)^(1/τ) as τ goes to 0 from the right.
>
> Am I understanding this correctly? I am confused because it seems a little
> convoluted to define this simple policy in terms of a temperature. When
> they mentioned temperature I was expecting something that slowly decays
> over time rather than only taking two trivial values.
>
> Thanks!
>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to