Re: [Computer-go] action-value Q for unexpanded nodes

Andy Sun, 03 Dec 2017 20:28:30 -0800

I made a pull request to Leela, and put some data in there. It shows the
details of how Q is initialized are actually important:
https://github.com/gcp/leela-zero/pull/238



2017-12-03 19:56 GMT-06:00 Álvaro Begué <[email protected]>:

> You are asking about the selection of the move that goes to a leaf. When
> the node before the move was expanded (in a previous playout), the value of
> Q(s,a) for that move was initialized to 0.
>
> The UCB-style formula they use in the tree part of the playout is such
> that the first few visits will follow the probability distribution from the
> policy output of the network, and over time it converges to using primarily
> the moves that have best results. So the details of how Q is initialized
> are not very relevant.
>
>
> On Sun, Dec 3, 2017 at 5:11 PM, Andy <[email protected]> wrote:
>
>> Álvaro, you are quoting from "Expand and evaluate (Figure 2b)". But my
>> question is about the section before that "Select (Figure 2a)". So the node
>> has not been expanded+initialized.
>>
>> As Brian Lee mentioned, his MuGo uses the parent's value, which assumes
>> without further information the value should be close to the same as before.
>>
>> LeelaZ uses 1.1 for a "first play urgency", which assumes you should
>> prioritize getting at least one evaluation from the NN for each node.
>> https://github.com/gcp/leela-zero/blob/master/src/UCTNode.cpp#L323
>>
>> Finally using a value of 0 would seem to place extra confidence in the
>> policy net values.
>>
>> I feel like MuGo's implementation makes sense, but I'm trying to get some
>> experimental evidence showing the impact before suggesting it to Leela's
>> author. So far my self-play tests with different settings do not show a big
>> impact, but I am changing other variables at the same time.
>>
>> - Andy
>>
>>
>>
>> 2017-12-03 14:30 GMT-06:00 Álvaro Begué <[email protected]>:
>>
>>> The text in the appendix has the answer, in a paragraph titled "Expand
>>> and evaluate (Fig. 2b)":
>>>   "[...] The leaf node is expanded and and each edge (s_t, a) is
>>> initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
>>> p_a}; [...]"
>>>
>>>
>>>
>>> On Sun, Dec 3, 2017 at 11:27 AM, Andy <[email protected]> wrote:
>>>
>>>> Figure 2a shows two bolded Q+U max values. The second one is going to a
>>>> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
>>>> Q value from?
>>>>
>>>> The associated text doesn't clarify the situation: "Figure 2:
>>>> Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the
>>>> tree by selecting the edge with maximum action-value Q, plus an upper
>>>> confidence bound U that depends on a stored prior probability P and visit
>>>> count N for that edge (which is incremented once traversed). b The leaf
>>>> node is expanded..."
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2017-12-03 9:44 GMT-06:00 Álvaro Begué <[email protected]>:
>>>>
>>>>> I am not sure where in the paper you think they use Q(s,a) for a node
>>>>> s that hasn't been expanded yet. Q(s,a) is a property of an edge of the
>>>>> graph. At a leaf they only use the `value' output of the neural network.
>>>>>
>>>>> If this doesn't match your understanding of the paper, please point to
>>>>> the specific paragraph that you are having trouble with.
>>>>>
>>>>> Álvaro.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Dec 3, 2017 at 9:53 AM, Andy <[email protected]> wrote:
>>>>>
>>>>>> I don't see the AGZ paper explain what the mean action-value Q(s,a)
>>>>>> should be for a node that hasn't been expanded yet. The equation for 
>>>>>> Q(s,a)
>>>>>> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>>>>>> visits. But in this case N(s,a)=0 so that won't work.
>>>>>>
>>>>>> Does anyone know how this is supposed to work? Or is it another
>>>>>> detail AGZ didn't spell out?
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Computer-go mailing list
>>>>>> [email protected]
>>>>>> http://computer-go.org/mailman/listinfo/computer-go
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Computer-go mailing list
>>>>> [email protected]
>>>>> http://computer-go.org/mailman/listinfo/computer-go
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Computer-go mailing list
>>>> [email protected]
>>>> http://computer-go.org/mailman/listinfo/computer-go
>>>>
>>>
>>>
>>> _______________________________________________
>>> Computer-go mailing list
>>> [email protected]
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> _______________________________________________
>> Computer-go mailing list
>> [email protected]
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> _______________________________________________
> Computer-go mailing list
> [email protected]
> http://computer-go.org/mailman/listinfo/computer-go
>

_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

Reply via email to