Re: [Computer-go] AlphaGo Zero self-play temperature

2017-11-07 Thread uurtamo .
It's interesting to leave unused parameters or unnecessary
parameterizations in the paper. It telegraphs what was being tried as
opposed to simply writing something more concise and leaving the reader to
wonder why and how those decisions were made.

s.

On Nov 7, 2017 10:54 PM, "Imran Hendley"  wrote:

> Great, thanks guys!
>
> On Tue, Nov 7, 2017 at 1:51 PM, Gian-Carlo Pascutto  wrote:
>
>> On 7/11/2017 19:07, Imran Hendley wrote:
>> > Am I understanding this correctly?
>>
>> Yes.
>>
>> It's possible they had in-betweens or experimented with variations at
>> some point, then settled on the simplest case. You can vary the
>> randomness if you define it as a softmax with varying temperature,
>> that's harder if you only define the policy as select best or select
>> proportionally.
>>
>> --
>> GCP
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] AlphaGo Zero self-play temperature

2017-11-07 Thread Imran Hendley
Great, thanks guys!

On Tue, Nov 7, 2017 at 1:51 PM, Gian-Carlo Pascutto  wrote:

> On 7/11/2017 19:07, Imran Hendley wrote:
> > Am I understanding this correctly?
>
> Yes.
>
> It's possible they had in-betweens or experimented with variations at
> some point, then settled on the simplest case. You can vary the
> randomness if you define it as a softmax with varying temperature,
> that's harder if you only define the policy as select best or select
> proportionally.
>
> --
> GCP
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] AlphaGo Zero self-play temperature

2017-11-07 Thread Gian-Carlo Pascutto
On 7/11/2017 19:07, Imran Hendley wrote:
> Am I understanding this correctly?

Yes.

It's possible they had in-betweens or experimented with variations at
some point, then settled on the simplest case. You can vary the
randomness if you define it as a softmax with varying temperature,
that's harder if you only define the policy as select best or select
proportionally.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] AlphaGo Zero self-play temperature

2017-11-07 Thread uurtamo .
If I understand your question correctly, "goes to 1" can happen as quickly
or slowly as you'd like. Yes?

On Nov 7, 2017 7:26 PM, "Imran Hendley"  wrote:

Hi, I might be having trouble understanding the self-play policy for
AlphaGo Zero. Can someone let me know if I'm on the right track here?

The paper states:

In each position s, an MCTS search is executed, guided by the neural
network f_θ . The
MCTS search outputs probabilities π of playing each move.


This wasn't clear at first since MCTS outputs wins and visits, but later
the paper explains further:

MCTS may be viewed as a self-play algorithm that, given neural
network parameters θ and a root position s, computes a vector of search
probabilities recommending moves to play, π =​  α_θ(s), proportional to
the exponentiated visit count for each move, π_a ∝​  N(s, a)^(1/τ) , where
τ is
a temperature parameter.


So this makes sense, but when I looked for the schedule for decaying the
temperature all I found was the following in the Self-play section of
Methods:


For the first 30 moves of each game, the temperature is set to τ = ​1; this
selects moves proportionally to their visit count in MCTS, and ensures a
diverse
set of positions are encountered. For the remainder of the game, an
infinitesimal
temperature is used, τ→​0.

This sounds like they are sampling proportional to visits for the first 30
moves since τ = ​1 makes the exponent go away, and after that they are
playing the move with the most visits, since the probability of the move
with the most visits goes to 1 and the probability of all other moves goes
to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 ,
b)^(1/τ) as τ goes to 0 from the right.

Am I understanding this correctly? I am confused because it seems a little
convoluted to define this simple policy in terms of a temperature. When
they mentioned temperature I was expecting something that slowly decays
over time rather than only taking two trivial values.

Thanks!


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] AlphaGo Zero self-play temperature

2017-11-07 Thread Álvaro Begué
Your understanding matches mine. My guess is that they had a temperature
parameter in the code that would allow for things like slowly transitioning
from random sampling to deterministically picking the maximum, but they
ended up using only those particular values.

Álvaro.




On Tue, Nov 7, 2017 at 1:07 PM, Imran Hendley 
wrote:

> Hi, I might be having trouble understanding the self-play policy for
> AlphaGo Zero. Can someone let me know if I'm on the right track here?
>
> The paper states:
>
> In each position s, an MCTS search is executed, guided by the neural
> network f_θ . The
> MCTS search outputs probabilities π of playing each move.
>
>
> This wasn't clear at first since MCTS outputs wins and visits, but later
> the paper explains further:
>
> MCTS may be viewed as a self-play algorithm that, given neural
> network parameters θ and a root position s, computes a vector of search
> probabilities recommending moves to play, π =​  α_θ(s), proportional to
> the exponentiated visit count for each move, π_a ∝​  N(s, a)^(1/τ) , where
> τ is
> a temperature parameter.
>
>
> So this makes sense, but when I looked for the schedule for decaying the
> temperature all I found was the following in the Self-play section of
> Methods:
>
>
> For the first 30 moves of each game, the temperature is set to τ = ​1; this
> selects moves proportionally to their visit count in MCTS, and ensures a
> diverse
> set of positions are encountered. For the remainder of the game, an
> infinitesimal
> temperature is used, τ→​0.
>
> This sounds like they are sampling proportional to visits for the first 30
> moves since τ = ​1 makes the exponent go away, and after that they are
> playing the move with the most visits, since the probability of the move
> with the most visits goes to 1 and the probability of all other moves goes
> to zero in the expression π(a | s_0) = N(s_0 , a)^(1/τ) / ∑ b N(s_0 ,
> b)^(1/τ) as τ goes to 0 from the right.
>
> Am I understanding this correctly? I am confused because it seems a little
> convoluted to define this simple policy in terms of a temperature. When
> they mentioned temperature I was expecting something that slowly decays
> over time rather than only taking two trivial values.
>
> Thanks!
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go