Re: [Computer-go] AlphaGo Zero self-play temperature
It's interesting to leave unused parameters or unnecessary parameterizations in the paper. It telegraphs what was being tried as opposed to simply writing something more concise and leaving the reader to wonder why and how those decisions were made. s. On Nov 7, 2017 10:54 PM, "Imran Hendley"wrote: > Great, thanks guys! > > On Tue, Nov 7, 2017 at 1:51 PM, Gian-Carlo Pascutto wrote: > >> On 7/11/2017 19:07, Imran Hendley wrote: >> > Am I understanding this correctly? >> >> Yes. >> >> It's possible they had in-betweens or experimented with variations at >> some point, then settled on the simplest case. You can vary the >> randomness if you define it as a softmax with varying temperature, >> that's harder if you only define the policy as select best or select >> proportionally. >> >> -- >> GCP >> ___ >> Computer-go mailing list >> Computer-go@computer-go.org >> http://computer-go.org/mailman/listinfo/computer-go >> > > > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] AlphaGo Zero self-play temperature
Great, thanks guys! On Tue, Nov 7, 2017 at 1:51 PM, Gian-Carlo Pascuttowrote: > On 7/11/2017 19:07, Imran Hendley wrote: > > Am I understanding this correctly? > > Yes. > > It's possible they had in-betweens or experimented with variations at > some point, then settled on the simplest case. You can vary the > randomness if you define it as a softmax with varying temperature, > that's harder if you only define the policy as select best or select > proportionally. > > -- > GCP > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] AlphaGo Zero self-play temperature
On 7/11/2017 19:07, Imran Hendley wrote: > Am I understanding this correctly? Yes. It's possible they had in-betweens or experimented with variations at some point, then settled on the simplest case. You can vary the randomness if you define it as a softmax with varying temperature, that's harder if you only define the policy as select best or select proportionally. -- GCP ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] AlphaGo Zero self-play temperature
If I understand your question correctly, "goes to 1" can happen as quickly or slowly as you'd like. Yes? On Nov 7, 2017 7:26 PM, "Imran Hendley"wrote: Hi, I might be having trouble understanding the self-play policy for AlphaGo Zero. Can someone let me know if I'm on the right track here? The paper states: In each position s, an MCTS search is executed, guided by the neural network f_θ . The MCTS search outputs probabilities π of playing each move. This wasn't clear at first since MCTS outputs wins and visits, but later the paper explains further: MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π = α_θ(s), proportional to the exponentiated visit count for each move, π_a ∝ N(s, a)^(1/τ) , where τ is a temperature parameter. So this makes sense, but when I looked for the schedule for decaying the temperature all I found was the following in the Self-play section of Methods: For the first 30 moves of each game, the temperature is set to τ = 1; this selects moves proportionally to their visit count in MCTS, and ensures a diverse set of positions are encountered. For the remainder of the game, an infinitesimal temperature is used, τ→0. This sounds like they are sampling proportional to visits for the first 30 moves since τ = 1 makes the exponent go away, and after that they are playing the move with the most visits, since the probability of the move with the most visits goes to 1 and the probability of all other moves goes to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 , b)^(1/τ) as τ goes to 0 from the right. Am I understanding this correctly? I am confused because it seems a little convoluted to define this simple policy in terms of a temperature. When they mentioned temperature I was expecting something that slowly decays over time rather than only taking two trivial values. Thanks! ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] AlphaGo Zero self-play temperature
Your understanding matches mine. My guess is that they had a temperature parameter in the code that would allow for things like slowly transitioning from random sampling to deterministically picking the maximum, but they ended up using only those particular values. Álvaro. On Tue, Nov 7, 2017 at 1:07 PM, Imran Hendleywrote: > Hi, I might be having trouble understanding the self-play policy for > AlphaGo Zero. Can someone let me know if I'm on the right track here? > > The paper states: > > In each position s, an MCTS search is executed, guided by the neural > network f_θ . The > MCTS search outputs probabilities π of playing each move. > > > This wasn't clear at first since MCTS outputs wins and visits, but later > the paper explains further: > > MCTS may be viewed as a self-play algorithm that, given neural > network parameters θ and a root position s, computes a vector of search > probabilities recommending moves to play, π = α_θ(s), proportional to > the exponentiated visit count for each move, π_a ∝ N(s, a)^(1/τ) , where > τ is > a temperature parameter. > > > So this makes sense, but when I looked for the schedule for decaying the > temperature all I found was the following in the Self-play section of > Methods: > > > For the first 30 moves of each game, the temperature is set to τ = 1; this > selects moves proportionally to their visit count in MCTS, and ensures a > diverse > set of positions are encountered. For the remainder of the game, an > infinitesimal > temperature is used, τ→0. > > This sounds like they are sampling proportional to visits for the first 30 > moves since τ = 1 makes the exponent go away, and after that they are > playing the move with the most visits, since the probability of the move > with the most visits goes to 1 and the probability of all other moves goes > to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 , > b)^(1/τ) as τ goes to 0 from the right. > > Am I understanding this correctly? I am confused because it seems a little > convoluted to define this simple policy in terms of a temperature. When > they mentioned temperature I was expecting something that slowly decays > over time rather than only taking two trivial values. > > Thanks! > > > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go