Hi Yamato,

Could you give us the source code which you used?  Your algorithm is
too complicated, so it would be very helpful if possible.

Actually I think the source code would be much harder to understand! It is written inside RLGO, and makes use of a substantial existing framework that would take a lot of effort to understand. (On a separate note I am considering making RLGO open source at some point, but I'd prefer to spend some effort cleaning it up before making it public).

But I think maybe Algorithm 1 is much easier than you think:

A: Estimate value V* of every position in a training set, using deep rollouts.

B: Repeat, for each position in the training set
        1. Run M simulations, estimate value of position (call this V)
2. Run another N simulations, average the value of psi(s,a) over all positions and moves in games that black won (call this g)
        3. Adjust parameters by alpha * (V* - V) * g

The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach.

The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action).

It's also very important to be careful about signs and the colour to play - it's easy to make a mistake and follow the gradient in the wrong direction.

Is that any clearer?
-Dave

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to