Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver Thu, 30 Apr 2009 12:57:31 -0700

Hi Remi,

I understood this. What I find strange is that using -1/1 should beequivalent to using 0/1, but your algorithm behaves differently: itignores lost games with 0/1, and uses them with -1/1.
Imagine you add a big constant to z. One million, say. This does notchange the problem. You get either 1000000 or 1000001 as outcome ofa playout. But then, your estimate of the gradient becomes completenoise.
So maybe using -1/1 is better than 0/1 ? Since your algorithmdepends so much on the definition of the reward, there must be anoptimal way to set the reward. Or there must a better way to definean algorithm that would not depend on an offset in the reward.
There is still something wrong that I don't understand. There may bea way to quantify the amount of noise in the unbiased gradientestimate, and it would depend on the average reward. Probablysetting the average reward to zero is what would minimize noise inthe gradient estimate. This is just an intuitive guess.

Okay, now I understand your point :-) It's a good question - and Ithink you're right. In REINFORCE any baseline can be subtracted fromthe reward, without affecting the expected gradient, but possiblyreducing its variance. The baseline leading to the best estimate isindeed the average reward. So it should be the case that {-1,+1}would estimate the gradient g more efficiently than {0,1}, assumingthat we see similar numbers of black wins as white wins across thetraining set.

So to answer your question, we can safely modify the algorithm to use(z-b) instead of z, where b is the average reward. This would thenmake the {0,1} and {-1,+1} cases equivalent (with appropriate scalingof step-size). I don't think this would have affected the results wepresented (because all of the learning algorithms converged anyway, atleast approximately, during training) but it could be an importantmodification for larger boards.


-Dave

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

Reply via email to