David Silver wrote:
>A: Estimate value V* of every position in a training set, using deep  
>rollouts.
>
>B: Repeat, for each position in the training set
>       1. Run M simulations, estimate value of position (call this V)
>       2. Run another N simulations, average the value of psi(s,a) over all  
>positions and moves in games that black won (call this g)
>       3. Adjust parameters by alpha * (V* - V) * g

Thanks for the detailed explanation.
M, N and alpha are constant numbers, right?  What did you set them to?

>The feature vector is the set of patterns you use, with value 1 if a  
>pattern is matched and 0 otherwise. The simulation policy selects  
>actions in proportion to the exponentiated, weighted sum of all  
>matching patterns. For example let's say move a matches patterns 1 and  
>2, move b matches patterns 1 and 3, and move c matches patterns 2 and  
>4. Then move a would be selected with probability e^(theta1 +  
>theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +  
>theta4)). The theta values are the weights on the patterns which we  
>would like to learn. They are the log of the Elo ratings in Remi  
>Coulom's approach.

OK, I guess it is the formula 5 in the paper.

>The only tricky part is computing the vector psi(s,a). Each component  
>of psi(s,a) corresponds to a particular pattern, and is the difference  
>between the observed feature (i.e. whether the pattern actually  
>occurred after move a in position s) and the expected feature (the  
>average value of the pattern, weighted by the probability of selecting  
>each action).

I still don't understand this. Is it the formula 6?
Could you please give me an example like the above?

--
Yamato
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to