I am very confused about the new UCT-RAVE formula.
The equation 9 seems to mean:
variance_u = value_ur * (1 - value_ur) / n.
Is it wrong? If correct, why is it the variance?
I think that the variance of the UCT should be:
variance_u = value_u * (1 - value_u).
Hi Yamato,
There are two differences between your suggestion and the original
formula, so I'll try and address both:
1. Your formula gives the variance of a single simulation, with
probability value_u. But the more simulations you see, the more you
reduce the uncertainty, so you must divide by n.
In general, the variance of a single coin-flip (with probability p of
heads) is p(1-p).
The variance of the sum of n coin-flips is np(1-p).
The variance of the average of n coin-flips is p(1-p)/n. This is what
we want!
2. The variance of the estimate is actually given by: variance_u =
true_value_u * (1 - true_value_u) / n, where true_value_u is the real
probability of winning a simulation (for the current policy), if we
had access to an oracle. Unfortunately, we don't - so we use the best
available estimate. If we have seen a large number of simulations,
then you are right that value_u is the best estimate. But if we have
only seen a few simulations, then value_r gives a better estimate
(this is the point of RAVE!) The whole point of this approach is to
form the best possible estimate of true_value_u, by combining these
two estimates together. In a way this is somewhat circular: we use the
best estimate so far to compute the best new estimate. But I don't
think that is unreasonable in this case.
-Dave
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/