David Silver wrote:
>There are two differences between your suggestion and the original  
>formula, so I'll try and address both:
>
>1. Your formula gives the variance of a single simulation, with  
>probability value_u. But the more simulations you see, the more you  
>reduce the uncertainty, so you must divide by n.
>
>In general, the variance of a single coin-flip (with probability p of  
>heads) is p(1-p).
>The variance of the sum of n coin-flips is np(1-p).
>The variance of the average of n coin-flips is p(1-p)/n. This is what  
>we want!
>
>2. The variance of the estimate is actually given by: variance_u =  
>true_value_u * (1 - true_value_u) / n, where true_value_u is the real  
>probability of winning a simulation (for the current policy), if we  
>had access to an oracle. Unfortunately, we don't - so we use the best  
>available estimate. If we have seen a large number of simulations,  
>then you are right that value_u is the best estimate. But if we have  
>only seen a few simulations, then value_r gives a better estimate  
>(this is the point of RAVE!)   The whole point of this approach is to  
>form the best possible estimate of true_value_u, by combining these  
>two estimates together. In a way this is somewhat circular: we use the  
>best estimate so far to compute the best new estimate. But I don't  
>think that is unreasonable in this case.

Thanks for the detailed explanation. The formula became clear to me.

--
Yamato
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to