Álvaro, you are quoting from "Expand and evaluate (Figure 2b)". But my
question is about the section before that "Select (Figure 2a)". So the node
has not been expanded+initialized.

As Brian Lee mentioned, his MuGo uses the parent's value, which assumes
without further information the value should be close to the same as before.

LeelaZ uses 1.1 for a "first play urgency", which assumes you should
prioritize getting at least one evaluation from the NN for each node.
https://github.com/gcp/leela-zero/blob/master/src/UCTNode.cpp#L323

Finally using a value of 0 would seem to place extra confidence in the
policy net values.

I feel like MuGo's implementation makes sense, but I'm trying to get some
experimental evidence showing the impact before suggesting it to Leela's
author. So far my self-play tests with different settings do not show a big
impact, but I am changing other variables at the same time.

- Andy



2017-12-03 14:30 GMT-06:00 Álvaro Begué <alvaro.be...@gmail.com>:

> The text in the appendix has the answer, in a paragraph titled "Expand and
> evaluate (Fig. 2b)":
>   "[...] The leaf node is expanded and and each edge (s_t, a) is
> initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
> p_a}; [...]"
>
>
>
> On Sun, Dec 3, 2017 at 11:27 AM, Andy <andy.olsen...@gmail.com> wrote:
>
>> Figure 2a shows two bolded Q+U max values. The second one is going to a
>> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
>> Q value from?
>>
>> The associated text doesn't clarify the situation: "Figure 2: Monte-Carlo
>> tree search in AlphaGo Zero. a Each simulation traverses the tree by
>> selecting the edge with maximum action-value Q, plus an upper confidence
>> bound U that depends on a stored prior probability P and visit count N for
>> that edge (which is incremented once traversed). b The leaf node is
>> expanded..."
>>
>>
>>
>>
>>
>>
>> 2017-12-03 9:44 GMT-06:00 Álvaro Begué <alvaro.be...@gmail.com>:
>>
>>> I am not sure where in the paper you think they use Q(s,a) for a node s
>>> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
>>> graph. At a leaf they only use the `value' output of the neural network.
>>>
>>> If this doesn't match your understanding of the paper, please point to
>>> the specific paragraph that you are having trouble with.
>>>
>>> Álvaro.
>>>
>>>
>>>
>>> On Sun, Dec 3, 2017 at 9:53 AM, Andy <andy.olsen...@gmail.com> wrote:
>>>
>>>> I don't see the AGZ paper explain what the mean action-value Q(s,a)
>>>> should be for a node that hasn't been expanded yet. The equation for Q(s,a)
>>>> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>>>> visits. But in this case N(s,a)=0 so that won't work.
>>>>
>>>> Does anyone know how this is supposed to work? Or is it another detail
>>>> AGZ didn't spell out?
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Computer-go mailing list
>>>> Computer-go@computer-go.org
>>>> http://computer-go.org/mailman/listinfo/computer-go
>>>>
>>>
>>>
>>> _______________________________________________
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> _______________________________________________
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to