David Silver wrote:
Sorry, I should have made it clear that this assumes that we are
treating black wins as z=1 and white wins as z=0.
In this special case, the gradient is the average of games in which
black won.
But yes, more generally you need to include games won by both sides.
The algorithms in the paper still cover this case - I was just trying
to simplify their description to make it easy to understand the ideas.
I understood this. What I find strange is that using -1/1 should be
equivalent to using 0/1, but your algorithm behaves differently: it
ignores lost games with 0/1, and uses them with -1/1.
Imagine you add a big constant to z. One million, say. This does not
change the problem. You get either 1000000 or 1000001 as outcome of a
playout. But then, your estimate of the gradient becomes complete noise.
So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so
much on the definition of the reward, there must be an optimal way to
set the reward. Or there must a better way to define an algorithm that
would not depend on an offset in the reward.
The gradient already compensates for the playout policy (equation 9),
so in fact it would bias the gradient to sample uniformly at random!
Yes, you are right.
There is still something wrong that I don't understand. There may be a
way to quantify the amount of noise in the unbiased gradient estimate,
and it would depend on the average reward. Probably setting the average
reward to zero is what would minimize noise in the gradient estimate.
This is just an intuitive guess.
Rémi
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/