On Sat, Oct 30, 2010 at 6:58 PM, Willemien <[email protected]> wrote: > On Sat, Oct 30, 2010 at 11:24 PM, Álvaro Begué <[email protected]> wrote: >> On Sat, Oct 30, 2010 at 12:26 PM, Willemien <[email protected]> wrote: >>> On Sat, Oct 30, 2010 at 5:00 PM, Álvaro Begué <[email protected]> >>> wrote: >>>> On Sat, Oct 30, 2010 at 11:35 AM, Willemien <[email protected]> wrote: >>>>> On life in 19x 19 >>>>> http://www.lifein19x19.com/forum/viewtopic.php?p=37588#p37588 >>>>> >>>>> i found the statement >>>>> . >>>>> " I think there exists a proof that MCTS gives the same results as >>>>> normal minimax search when the number of playouts goes to infinity." >>>>> >>>>> Does this proof really exist and where can i find it? >>>> >>>> I haven't seen a formal proof, but it doesn't seem hard. Imagine you >>>> are in a node such that every move leads to a terminal node. Since >>>> UCB1 (or whatever other approach to the multi-armed bandit problem you >>>> are using) will pick the best move a fraction of the time that >>>> converges to 1, the average result it will return when exploring it >>>> will be the correct minimax value of the [depth one] tree. Now proceed >>>> by induction on the depth of the tree. >>>> >>> >>> That is what i don't believe, for having a chanche of 1 for every >>> move you must be able to go back on a move (the move was not as bad as >>> tought before) and that mechanism is i missing in MCTS >> >> This seems to indicate that you didn't understand how move selection >> works in UCB1. If you play infinitely many times, every move will be >> tried infinitely many times, but the best move will be played almost >> always. A move that didn't look as good in earlier playouts will have >> a large variance, which will eventually make it be picked. >> >> Think about it this way: At every stage you pick the move that >> maximizes average+S*standard_deviation, where S stands for "Something" >> :). If S is zero, you'll simply select that move that has the best >> average. If S is infinity, you'll pick the move that has been tried >> the least often (largest standard deviation). If S is somewhere in >> between, there is a balance. But since S is typically something like >> sqrt(log(number_of_visits)), it tends to infinity, so every move will >> be given another chance eventually. Since S grows very slowly, the >> best move will still be picked most of the time. If you work out the >> Math, you'll find that the best move is picked with probability 1 >> asymptotically. > > Let me try it with numbers. > at a certain point there are only 2 options > A a small winning sequence and B a sequence that looks winning but in > fact is losing. > > first we test both 1000 times > > A gives 1000 losses ( average 0 standard deviation 0) > B gives 500 wins 500 losses ( average .5 standard deviation .5) > > So B will be concidered first > > But when will A be reconcidered?
Ah, the problem is that the standard deviation is not what you think it is (and I think I might have made a similar mistake earlier in this thread, so it's very understandable. Here's one way to think about it. There is a hidden probability of winning the simulations that start with move A. Initially, since we don't know anything about this probability, we'll use a prior that is a uniform distribution in the interval [0,1]. After 1000 losses, using Bayes' theorem we can prove that the posterior probability is a beta distribution with parameters alpha = 1, beta = 1001. This distribution has expected value 1/1002 and standard deviation sqrt(alpha*beta/((alpha+beta)^2*(alpha+beta+1))) = sqrt(1001/1007016012) = 0.00099700847656419935698... (Note: UCB1 actually uses something like sqrt(1/(alpha+beta-2)) as the estimate for the standard deviation, which is correct up to multiplication by a constant if alpha and beta are about the same. Otherwise it overestimates the standard deviation, especially if the frequency of wins is close to 0 or 1.) So eventually S will become large enough that A will be chosen. _______________________________________________ Computer-go mailing list [email protected] http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
