Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-07 Thread Gian-Carlo Pascutto
On 03-12-17 21:39, Brian Lee wrote:
> It should default to the Q of the parent node. Otherwise, let's say that
> the root node is a losing position. Upon choosing a followup move, the Q
> will be updated to a very negative value, and that node won't get
> explored again - at least until all 362 top-level children have been
> explored and revealed to have negative values. So without initializing Q
> to the parent's Q, you would end up wasting 362 MCTS iterations.

Note that the same argument could be made for making it 0, which some
people think the AGZ paper implies, so the above can't be the entire
explanation.

That said, empirical testing indicates that initializing Q(s, a) to the
parent is indeed a well performing setting for both strong and weak
policy networks.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-06 Thread Aja Huang
2017-12-06 13:52 GMT+00:00 Gian-Carlo Pascutto :

> On 06-12-17 11:47, Aja Huang wrote:
> > All I can say is that first-play-urgency is not a significant
> > technical detail, and what's why we didn't specify it in the paper.
>
> I will have to disagree here. Of course, it's always possible I'm
> misunderstanding something, or I have a program bug that I'm mixing up
> with this.
>

No matter I agree with you or not, unfortunately it's not up to me to
decide whether I can answer the question, even if I am personally happy to
(in fact, this post might be already exceeding my barrier a bit). I hope
you understand, and good luck with making it works.

I'm very happy the two Go papers we published have helped the Go community.
My dream was fulfilled and I've switched to pursue other challenges. :)

Aja


> Or maybe you mean that you expect the program to improve regardless of
> this setting. In any case, I've now seen people state here twice that
> this is detail that doesn't matter. But practical results suggest
> otherwise.
>
> For a strong supervised network, FPU=0 (i.e. not exploring all successor
> nodes for a longer time, relying strongly on policy priors) is much
> stronger. I've seen this in Leela Zero after we tested it, and I've
> known it to be true from regular Leela for a long time. IIRC, the strong
> open source Go bots also use some form of progressive widening, which
> produces the same effect.
>
> For a weak RL network without much useful policy priors, FPU>1 is much
> stronger than FPU=0.
>
> Now these are relative scores of course, so one could argue they don't
> affect the learning process. But they actually do that as well!
>
> The new AZ paper uses MCTS playouts = 800, and plays proportionally
> according to MCTS output. (Previous AGZ had playouts = 1600,
> proportional for first 30 moves).
>
> Consider what this means for the search probability outputs, exactly the
> thing the policy network has to learn. With FPU=1, the move
> probabilities are much more uniform, and the moves played are
> consequentially much more likely to be bad or even blunders, because
> there are less playouts that can be spent on the best move, even if it
> was found.
>
> > The initial value of Q is not very important because Q+U is
> > dominated by the U piece when the number of visits is small.
>
> a = Q(s, a) + coeff * P(s,a) * (sqrt(parent->visits) / 1.0f +
> child->visits());
>
> Assume parent->visits = 100, sqrt = 10
> Assume child->visits = 0
> Assume P(s, a) = 0.0027 (near uniform prior for "weak" network)
>
> The right most side of this (U term) is ~1. This clearly does not
> dominate the Q term. If Q > 1 (classic FPU) then every child node will
> get expanded. If Q = 0 (Q(s, a) = 0) then the first picked child
> (largest policy prior) will get something like 10 expansions before
> another child gets picked. That's a massive difference in search tree
> shape, *especially* with only 800 total playouts.
>
> --
> GCP
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-06 Thread Gian-Carlo Pascutto
On 06-12-17 11:47, Aja Huang wrote:
> All I can say is that first-play-urgency is not a significant 
> technical detail, and what's why we didn't specify it in the paper.

I will have to disagree here. Of course, it's always possible I'm
misunderstanding something, or I have a program bug that I'm mixing up
with this.

Or maybe you mean that you expect the program to improve regardless of
this setting. In any case, I've now seen people state here twice that
this is detail that doesn't matter. But practical results suggest otherwise.

For a strong supervised network, FPU=0 (i.e. not exploring all successor
nodes for a longer time, relying strongly on policy priors) is much
stronger. I've seen this in Leela Zero after we tested it, and I've
known it to be true from regular Leela for a long time. IIRC, the strong
open source Go bots also use some form of progressive widening, which
produces the same effect.

For a weak RL network without much useful policy priors, FPU>1 is much
stronger than FPU=0.

Now these are relative scores of course, so one could argue they don't
affect the learning process. But they actually do that as well!

The new AZ paper uses MCTS playouts = 800, and plays proportionally
according to MCTS output. (Previous AGZ had playouts = 1600,
proportional for first 30 moves).

Consider what this means for the search probability outputs, exactly the
thing the policy network has to learn. With FPU=1, the move
probabilities are much more uniform, and the moves played are
consequentially much more likely to be bad or even blunders, because
there are less playouts that can be spent on the best move, even if it
was found.

> The initial value of Q is not very important because Q+U is
> dominated by the U piece when the number of visits is small.

a = Q(s, a) + coeff * P(s,a) * (sqrt(parent->visits) / 1.0f +
child->visits());

Assume parent->visits = 100, sqrt = 10
Assume child->visits = 0
Assume P(s, a) = 0.0027 (near uniform prior for "weak" network)

The right most side of this (U term) is ~1. This clearly does not
dominate the Q term. If Q > 1 (classic FPU) then every child node will
get expanded. If Q = 0 (Q(s, a) = 0) then the first picked child
(largest policy prior) will get something like 10 expansions before
another child gets picked. That's a massive difference in search tree
shape, *especially* with only 800 total playouts.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-06 Thread Andy
Thanks for letting us know the situation Aja. It must be hard for an
engineer to not be able to discuss the details of his work!

As for the first-play-urgency value, if we indulge in some reading between
the lines: It's possible to interpret the paper as saying
first-play-urgency is zero. After rereading it myself that's the way I read
it now. But if that is true maybe Aja would have said "guys the paper
already says it is zero." That would imply it's actually some other value.

That is probably reading far too much into Aja's reply, but it's something
to think about.


2017-12-06 4:47 GMT-06:00 Aja Huang :

>
>
> 2017-12-06 9:23 GMT+00:00 Gian-Carlo Pascutto :
>
>> On 03-12-17 17:57, Rémi Coulom wrote:
>> > They have a Q(s,a) term in their node-selection formula, but they
>> > don't tell what value they give to an action that has not yet been
>> > visited. Maybe Aja can tell us.
>>
>> FWIW I already asked Aja this exact question a bit after the paper came
>> out and he told me he cannot answer questions about unpublished details.
>>
>
> Yes, I did ask my manager if I could answer your question but he
> specifically said no. All I can say is that first-play-urgency is not a
> significant technical detail, and what's why we didn't specify it in the
> paper.
>
> Aja
>
>
>
>> This is not very promising regarding reproducibility considering the AZ
>> paper is even lighter on them.
>>
>> Another issue which is up in the air is whether the choice of the number
>> of playouts for the MCTS part represents an implicit balancing between
>> self-play and training speed. This is particularly relevant if the
>> evaluation step is removed. But it's possible even DeepMind doesn't know
>> the answer for sure. They had a setup, and they optimized it. It's not
>> clear which parts generalize.
>>
>> (Usually one wonders about such things in terms of algorithms, but here
>> one wonders about it in terms of hardware!)
>>
>> --
>> GCP
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-06 Thread Aja Huang
2017-12-06 9:23 GMT+00:00 Gian-Carlo Pascutto :

> On 03-12-17 17:57, Rémi Coulom wrote:
> > They have a Q(s,a) term in their node-selection formula, but they
> > don't tell what value they give to an action that has not yet been
> > visited. Maybe Aja can tell us.
>
> FWIW I already asked Aja this exact question a bit after the paper came
> out and he told me he cannot answer questions about unpublished details.
>

Yes, I did ask my manager if I could answer your question but he
specifically said no. All I can say is that first-play-urgency is not a
significant technical detail, and what's why we didn't specify it in the
paper.

Aja



> This is not very promising regarding reproducibility considering the AZ
> paper is even lighter on them.
>
> Another issue which is up in the air is whether the choice of the number
> of playouts for the MCTS part represents an implicit balancing between
> self-play and training speed. This is particularly relevant if the
> evaluation step is removed. But it's possible even DeepMind doesn't know
> the answer for sure. They had a setup, and they optimized it. It's not
> clear which parts generalize.
>
> (Usually one wonders about such things in terms of algorithms, but here
> one wonders about it in terms of hardware!)
>
> --
> GCP
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Andy
I made a pull request to Leela, and put some data in there. It shows the
details of how Q is initialized are actually important:
https://github.com/gcp/leela-zero/pull/238


2017-12-03 19:56 GMT-06:00 Álvaro Begué :

> You are asking about the selection of the move that goes to a leaf. When
> the node before the move was expanded (in a previous playout), the value of
> Q(s,a) for that move was initialized to 0.
>
> The UCB-style formula they use in the tree part of the playout is such
> that the first few visits will follow the probability distribution from the
> policy output of the network, and over time it converges to using primarily
> the moves that have best results. So the details of how Q is initialized
> are not very relevant.
>
>
> On Sun, Dec 3, 2017 at 5:11 PM, Andy  wrote:
>
>> Álvaro, you are quoting from "Expand and evaluate (Figure 2b)". But my
>> question is about the section before that "Select (Figure 2a)". So the node
>> has not been expanded+initialized.
>>
>> As Brian Lee mentioned, his MuGo uses the parent's value, which assumes
>> without further information the value should be close to the same as before.
>>
>> LeelaZ uses 1.1 for a "first play urgency", which assumes you should
>> prioritize getting at least one evaluation from the NN for each node.
>> https://github.com/gcp/leela-zero/blob/master/src/UCTNode.cpp#L323
>>
>> Finally using a value of 0 would seem to place extra confidence in the
>> policy net values.
>>
>> I feel like MuGo's implementation makes sense, but I'm trying to get some
>> experimental evidence showing the impact before suggesting it to Leela's
>> author. So far my self-play tests with different settings do not show a big
>> impact, but I am changing other variables at the same time.
>>
>> - Andy
>>
>>
>>
>> 2017-12-03 14:30 GMT-06:00 Álvaro Begué :
>>
>>> The text in the appendix has the answer, in a paragraph titled "Expand
>>> and evaluate (Fig. 2b)":
>>>   "[...] The leaf node is expanded and and each edge (s_t, a) is
>>> initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
>>> p_a}; [...]"
>>>
>>>
>>>
>>> On Sun, Dec 3, 2017 at 11:27 AM, Andy  wrote:
>>>
 Figure 2a shows two bolded Q+U max values. The second one is going to a
 leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
 Q value from?

 The associated text doesn't clarify the situation: "Figure 2:
 Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the
 tree by selecting the edge with maximum action-value Q, plus an upper
 confidence bound U that depends on a stored prior probability P and visit
 count N for that edge (which is incremented once traversed). b The leaf
 node is expanded..."






 2017-12-03 9:44 GMT-06:00 Álvaro Begué :

> I am not sure where in the paper you think they use Q(s,a) for a node
> s that hasn't been expanded yet. Q(s,a) is a property of an edge of the
> graph. At a leaf they only use the `value' output of the neural network.
>
> If this doesn't match your understanding of the paper, please point to
> the specific paragraph that you are having trouble with.
>
> Álvaro.
>
>
>
> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>
>> I don't see the AGZ paper explain what the mean action-value Q(s,a)
>> should be for a node that hasn't been expanded yet. The equation for 
>> Q(s,a)
>> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>> visits. But in this case N(s,a)=0 so that won't work.
>>
>> Does anyone know how this is supposed to work? Or is it another
>> detail AGZ didn't spell out?
>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>


 ___
 Computer-go mailing list
 Computer-go@computer-go.org
 http://computer-go.org/mailman/listinfo/computer-go

>>>
>>>
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Andy
Álvaro, you are quoting from "Expand and evaluate (Figure 2b)". But my
question is about the section before that "Select (Figure 2a)". So the node
has not been expanded+initialized.

As Brian Lee mentioned, his MuGo uses the parent's value, which assumes
without further information the value should be close to the same as before.

LeelaZ uses 1.1 for a "first play urgency", which assumes you should
prioritize getting at least one evaluation from the NN for each node.
https://github.com/gcp/leela-zero/blob/master/src/UCTNode.cpp#L323

Finally using a value of 0 would seem to place extra confidence in the
policy net values.

I feel like MuGo's implementation makes sense, but I'm trying to get some
experimental evidence showing the impact before suggesting it to Leela's
author. So far my self-play tests with different settings do not show a big
impact, but I am changing other variables at the same time.

- Andy



2017-12-03 14:30 GMT-06:00 Álvaro Begué :

> The text in the appendix has the answer, in a paragraph titled "Expand and
> evaluate (Fig. 2b)":
>   "[...] The leaf node is expanded and and each edge (s_t, a) is
> initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
> p_a}; [...]"
>
>
>
> On Sun, Dec 3, 2017 at 11:27 AM, Andy  wrote:
>
>> Figure 2a shows two bolded Q+U max values. The second one is going to a
>> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
>> Q value from?
>>
>> The associated text doesn't clarify the situation: "Figure 2: Monte-Carlo
>> tree search in AlphaGo Zero. a Each simulation traverses the tree by
>> selecting the edge with maximum action-value Q, plus an upper confidence
>> bound U that depends on a stored prior probability P and visit count N for
>> that edge (which is incremented once traversed). b The leaf node is
>> expanded..."
>>
>>
>>
>>
>>
>>
>> 2017-12-03 9:44 GMT-06:00 Álvaro Begué :
>>
>>> I am not sure where in the paper you think they use Q(s,a) for a node s
>>> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
>>> graph. At a leaf they only use the `value' output of the neural network.
>>>
>>> If this doesn't match your understanding of the paper, please point to
>>> the specific paragraph that you are having trouble with.
>>>
>>> Álvaro.
>>>
>>>
>>>
>>> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>>>
 I don't see the AGZ paper explain what the mean action-value Q(s,a)
 should be for a node that hasn't been expanded yet. The equation for Q(s,a)
 has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
 visits. But in this case N(s,a)=0 so that won't work.

 Does anyone know how this is supposed to work? Or is it another detail
 AGZ didn't spell out?



 ___
 Computer-go mailing list
 Computer-go@computer-go.org
 http://computer-go.org/mailman/listinfo/computer-go

>>>
>>>
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Brian Lee
It should default to the Q of the parent node. Otherwise, let's say that
the root node is a losing position. Upon choosing a followup move, the Q
will be updated to a very negative value, and that node won't get explored
again - at least until all 362 top-level children have been explored and
revealed to have negative values. So without initializing Q to the parent's
Q, you would end up wasting 362 MCTS iterations.

Brian

On Sun, Dec 3, 2017 at 3:25 PM <computer-go-requ...@computer-go.org> wrote:

> Send Computer-go mailing list submissions to
> computer-go@computer-go.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://computer-go.org/mailman/listinfo/computer-go
> or, via email, send a message with subject or body 'help' to
> computer-go-requ...@computer-go.org
>
> You can reach the person managing the list at
> computer-go-ow...@computer-go.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Computer-go digest..."
>
>
> Today's Topics:
>
>1. action-value Q for unexpanded nodes (Andy)
>2. Re: action-value Q for unexpanded nodes (Álvaro Begué)
>3. Re: action-value Q for unexpanded nodes (Andy)
>4. Re: action-value Q for unexpanded nodes (Rémi Coulom)
>
>
> --
>
> Message: 1
> Date: Sun, 3 Dec 2017 08:53:02 -0600
> From: Andy <andy.olsen...@gmail.com>
> To: computer-go <computer-go@computer-go.org>
> Subject: [Computer-go] action-value Q for unexpanded nodes
> Message-ID:
> <
> caatbd5cguzt4arbsum8-d91j31znq+2tkzpbxv4u5fxthhd...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I don't see the AGZ paper explain what the mean action-value Q(s,a) should
> be for a node that hasn't been expanded yet. The equation for Q(s,a) has
> the term 1/N(s,a) in it because it's supposed to average over N(s,a)
> visits. But in this case N(s,a)=0 so that won't work.
>
> Does anyone know how this is supposed to work? Or is it another detail AGZ
> didn't spell out?
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://computer-go.org/pipermail/computer-go/attachments/20171203/8fc94bcd/attachment-0001.html
> >
>
> --------------
>
> Message: 2
> Date: Sun, 3 Dec 2017 10:44:00 -0500
> From: Álvaro Begué <alvaro.be...@gmail.com>
> To: computer-go <computer-go@computer-go.org>
> Subject: Re: [Computer-go] action-value Q for unexpanded nodes
> Message-ID:
> <
> caf8dvmu_f0ue2yykvbwvkrcsuy93wn-x9m8tgmcz+dqfbe4...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I am not sure where in the paper you think they use Q(s,a) for a node s
> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
> graph. At a leaf they only use the `value' output of the neural network.
>
> If this doesn't match your understanding of the paper, please point to the
> specific paragraph that you are having trouble with.
>
> Álvaro.
>
>
>
> On Sun, Dec 3, 2017 at 9:53 AM, Andy <andy.olsen...@gmail.com> wrote:
>
> > I don't see the AGZ paper explain what the mean action-value Q(s,a)
> should
> > be for a node that hasn't been expanded yet. The equation for Q(s,a) has
> > the term 1/N(s,a) in it because it's supposed to average over N(s,a)
> > visits. But in this case N(s,a)=0 so that won't work.
> >
> > Does anyone know how this is supposed to work? Or is it another detail
> AGZ
> > didn't spell out?
> >
> >
> >
> > ___
> > Computer-go mailing list
> > Computer-go@computer-go.org
> > http://computer-go.org/mailman/listinfo/computer-go
> >
> ------ next part --
> An HTML attachment was scrubbed...
> URL: <
> http://computer-go.org/pipermail/computer-go/attachments/20171203/b8f3d1cc/attachment-0001.html
> >
>
> --
>
> Message: 3
> Date: Sun, 3 Dec 2017 10:27:16 -0600
> From: Andy <andy.olsen...@gmail.com>
> To: computer-go <computer-go@computer-go.org>
> Subject: Re: [Computer-go] action-value Q for unexpanded nodes
> Message-ID:
> <
> caatbd5cbdtsj7whjm9mybrtdbzlhqdujitosn49ce8kut5_...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Figure 2a shows two bolded Q+U max values. The second one is going to a
> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
> Q value from?
>
> The associated text doesn't clarify the situation: &

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Álvaro Begué
The text in the appendix has the answer, in a paragraph titled "Expand and
evaluate (Fig. 2b)":
  "[...] The leaf node is expanded and and each edge (s_t, a) is
initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
p_a}; [...]"



On Sun, Dec 3, 2017 at 11:27 AM, Andy  wrote:

> Figure 2a shows two bolded Q+U max values. The second one is going to a
> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
> Q value from?
>
> The associated text doesn't clarify the situation: "Figure 2: Monte-Carlo
> tree search in AlphaGo Zero. a Each simulation traverses the tree by
> selecting the edge with maximum action-value Q, plus an upper confidence
> bound U that depends on a stored prior probability P and visit count N for
> that edge (which is incremented once traversed). b The leaf node is
> expanded..."
>
>
>
>
>
>
> 2017-12-03 9:44 GMT-06:00 Álvaro Begué :
>
>> I am not sure where in the paper you think they use Q(s,a) for a node s
>> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
>> graph. At a leaf they only use the `value' output of the neural network.
>>
>> If this doesn't match your understanding of the paper, please point to
>> the specific paragraph that you are having trouble with.
>>
>> Álvaro.
>>
>>
>>
>> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>>
>>> I don't see the AGZ paper explain what the mean action-value Q(s,a)
>>> should be for a node that hasn't been expanded yet. The equation for Q(s,a)
>>> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>>> visits. But in this case N(s,a)=0 so that won't work.
>>>
>>> Does anyone know how this is supposed to work? Or is it another detail
>>> AGZ didn't spell out?
>>>
>>>
>>>
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Rémi Coulom
They have a Q(s,a) term in their node-selection formula, but they don't tell 
what value they give to an action that has not yet been visited. Maybe Aja can 
tell us.

- Mail original -
De: "Álvaro Begué" <alvaro.be...@gmail.com>
À: "computer-go" <computer-go@computer-go.org>
Envoyé: Dimanche 3 Décembre 2017 16:44:00
Objet: Re: [Computer-go] action-value Q for unexpanded nodes




I am not sure where in the paper you think they use Q(s,a) for a node s that 
hasn't been expanded yet. Q(s,a) is a property of an edge of the graph. At a 
leaf they only use the `value' output of the neural network. 

If this doesn't match your understanding of the paper, please point to the 
specific paragraph that you are having trouble with. 

Álvaro. 





On Sun, Dec 3, 2017 at 9:53 AM, Andy < andy.olsen...@gmail.com > wrote: 



I don't see the AGZ paper explain what the mean action-value Q(s,a) should be 
for a node that hasn't been expanded yet. The equation for Q(s,a) has the term 
1/N(s,a) in it because it's supposed to average over N(s,a) visits. But in this 
case N(s,a)=0 so that won't work. 


Does anyone know how this is supposed to work? Or is it another detail AGZ 
didn't spell out? 




___ 
Computer-go mailing list 
Computer-go@computer-go.org 
http://computer-go.org/mailman/listinfo/computer-go 


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Andy
Figure 2a shows two bolded Q+U max values. The second one is going to a
leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
Q value from?

The associated text doesn't clarify the situation: "Figure 2: Monte-Carlo
tree search in AlphaGo Zero. a Each simulation traverses the tree by
selecting the edge with maximum action-value Q, plus an upper confidence
bound U that depends on a stored prior probability P and visit count N for
that edge (which is incremented once traversed). b The leaf node is
expanded..."






2017-12-03 9:44 GMT-06:00 Álvaro Begué :

> I am not sure where in the paper you think they use Q(s,a) for a node s
> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
> graph. At a leaf they only use the `value' output of the neural network.
>
> If this doesn't match your understanding of the paper, please point to the
> specific paragraph that you are having trouble with.
>
> Álvaro.
>
>
>
> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>
>> I don't see the AGZ paper explain what the mean action-value Q(s,a)
>> should be for a node that hasn't been expanded yet. The equation for Q(s,a)
>> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>> visits. But in this case N(s,a)=0 so that won't work.
>>
>> Does anyone know how this is supposed to work? Or is it another detail
>> AGZ didn't spell out?
>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Álvaro Begué
I am not sure where in the paper you think they use Q(s,a) for a node s
that hasn't been expanded yet. Q(s,a) is a property of an edge of the
graph. At a leaf they only use the `value' output of the neural network.

If this doesn't match your understanding of the paper, please point to the
specific paragraph that you are having trouble with.

Álvaro.



On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:

> I don't see the AGZ paper explain what the mean action-value Q(s,a) should
> be for a node that hasn't been expanded yet. The equation for Q(s,a) has
> the term 1/N(s,a) in it because it's supposed to average over N(s,a)
> visits. But in this case N(s,a)=0 so that won't work.
>
> Does anyone know how this is supposed to work? Or is it another detail AGZ
> didn't spell out?
>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go