David Silver wrote: >A: Estimate value V* of every position in a training set, using deep >rollouts. > >B: Repeat, for each position in the training set > 1. Run M simulations, estimate value of position (call this V) > 2. Run another N simulations, average the value of psi(s,a) over all >positions and moves in games that black won (call this g) > 3. Adjust parameters by alpha * (V* - V) * g
Thanks for the detailed explanation. M, N and alpha are constant numbers, right? What did you set them to? >The feature vector is the set of patterns you use, with value 1 if a >pattern is matched and 0 otherwise. The simulation policy selects >actions in proportion to the exponentiated, weighted sum of all >matching patterns. For example let's say move a matches patterns 1 and >2, move b matches patterns 1 and 3, and move c matches patterns 2 and >4. Then move a would be selected with probability e^(theta1 + >theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + >theta4)). The theta values are the weights on the patterns which we >would like to learn. They are the log of the Elo ratings in Remi >Coulom's approach. OK, I guess it is the formula 5 in the paper. >The only tricky part is computing the vector psi(s,a). Each component >of psi(s,a) corresponds to a particular pattern, and is the difference >between the observed feature (i.e. whether the pattern actually >occurred after move a in position s) and the expected feature (the >average value of the pattern, weighted by the probability of selecting >each action). I still don't understand this. Is it the formula 6? Could you please give me an example like the above? -- Yamato _______________________________________________ computer-go mailing list [email protected] http://www.computer-go.org/mailman/listinfo/computer-go/
