I am very confused about the new UCT-RAVE formula.
The equation 9 seems to mean:

variance_u = value_ur * (1 - value_ur) / n.

Is it wrong?  If correct, why is it the variance?
I think that the variance of the UCT should be:

variance_u = value_u * (1 - value_u).
Hi Yamato,

There are two differences between your suggestion and the original formula, so I'll try and address both:

1. Your formula gives the variance of a single simulation, with probability value_u. But the more simulations you see, the more you reduce the uncertainty, so you must divide by n.

In general, the variance of a single coin-flip (with probability p of heads) is p(1-p).
The variance of the sum of n coin-flips is np(1-p).
The variance of the average of n coin-flips is p(1-p)/n. This is what we want!

2. The variance of the estimate is actually given by: variance_u = true_value_u * (1 - true_value_u) / n, where true_value_u is the real probability of winning a simulation (for the current policy), if we had access to an oracle. Unfortunately, we don't - so we use the best available estimate. If we have seen a large number of simulations, then you are right that value_u is the best estimate. But if we have only seen a few simulations, then value_r gives a better estimate (this is the point of RAVE!) The whole point of this approach is to form the best possible estimate of true_value_u, by combining these two estimates together. In a way this is somewhat circular: we use the best estimate so far to compute the best new estimate. But I don't think that is unreasonable in this case.

-Dave

_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to