Hi Remi,

This is strange: you do not take lost playouts into consideration.I believe there is a problem with your estimation of the gradient.Suppose for instance that you count z = +1 for a win, and z = -1 fora loss. Then you would take lost playouts into consideration. Thismakes me a little suspicious.

`Sorry, I should have made it clear that this assumes that we are`

`treating black wins as z=1 and white wins as z=0.`

`In this special case, the gradient is the average of games in which`

`black won.`

`But yes, more generally you need to include games won by both sides.`

`The algorithms in the paper still cover this case - I was just trying`

`to simplify their description to make it easy to understand the ideas.`

The fundamental problem here may be that your estimate of thegradient is biased by the playout policy. You should probably sampleX(s) uniformly at random to have an unbiased estimator. Maybe thiscan be fixed with importance sampling, and then you may get aformula that is symmetrical regarding wins and losses. I don't havetime to do it now, but it may be worth taking a look.

`The gradient already compensates for the playout policy (equation 9),`

`so in fact it would bias the gradient to sample uniformly at random!`

`In equation 9, the gradient is taken with respect to the playout`

`policy parameters. Using the product rule (third line), the gradient`

`is equal to the playout policy probabilities multiplied by the sum`

`likelihood ratios multiplied by the simulation outcomes z. This`

`gradient can be computed by sampling playouts instead of multiplying`

`by the playout policy probabilities. This is also why games with`

`outcomes of z=0 can be ignored - they don't affect this gradient`

`computation.`

More precisely: you should estimate the value of N playouts as Sump_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sump_i z_i / Sum p_i. This would be better.Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ?

`I think a similar point applies here. We care about the expected value`

`of the playout policy, which can be sampled directly from playouts,`

`instead of multiplying by the policy probabilities. You would only`

`need importance sampling if you were using a different playout policy`

`to the one which you are evaluating. But I guess I'm not seeing any`

`good reason to do this?`

-Dave _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/