David Silver wrote: >There are two differences between your suggestion and the original >formula, so I'll try and address both: > >1. Your formula gives the variance of a single simulation, with >probability value_u. But the more simulations you see, the more you >reduce the uncertainty, so you must divide by n. > >In general, the variance of a single coin-flip (with probability p of >heads) is p(1-p). >The variance of the sum of n coin-flips is np(1-p). >The variance of the average of n coin-flips is p(1-p)/n. This is what >we want! > >2. The variance of the estimate is actually given by: variance_u = >true_value_u * (1 - true_value_u) / n, where true_value_u is the real >probability of winning a simulation (for the current policy), if we >had access to an oracle. Unfortunately, we don't - so we use the best >available estimate. If we have seen a large number of simulations, >then you are right that value_u is the best estimate. But if we have >only seen a few simulations, then value_r gives a better estimate >(this is the point of RAVE!) The whole point of this approach is to >form the best possible estimate of true_value_u, by combining these >two estimates together. In a way this is somewhat circular: we use the >best estimate so far to compute the best new estimate. But I don't >think that is unreasonable in this case.
Thanks for the detailed explanation. The formula became clear to me. -- Yamato _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/