On Sat, Oct 30, 2010 at 11:24 PM, Álvaro Begué <[email protected]> wrote: > On Sat, Oct 30, 2010 at 12:26 PM, Willemien <[email protected]> wrote: >> On Sat, Oct 30, 2010 at 5:00 PM, Álvaro Begué <[email protected]> wrote: >>> On Sat, Oct 30, 2010 at 11:35 AM, Willemien <[email protected]> wrote: >>>> On life in 19x 19 >>>> http://www.lifein19x19.com/forum/viewtopic.php?p=37588#p37588 >>>> >>>> i found the statement >>>> . >>>> " I think there exists a proof that MCTS gives the same results as >>>> normal minimax search when the number of playouts goes to infinity." >>>> >>>> Does this proof really exist and where can i find it? >>> >>> I haven't seen a formal proof, but it doesn't seem hard. Imagine you >>> are in a node such that every move leads to a terminal node. Since >>> UCB1 (or whatever other approach to the multi-armed bandit problem you >>> are using) will pick the best move a fraction of the time that >>> converges to 1, the average result it will return when exploring it >>> will be the correct minimax value of the [depth one] tree. Now proceed >>> by induction on the depth of the tree. >>> >> >> That is what i don't believe, for having a chanche of 1 for every >> move you must be able to go back on a move (the move was not as bad as >> tought before) and that mechanism is i missing in MCTS > > This seems to indicate that you didn't understand how move selection > works in UCB1. If you play infinitely many times, every move will be > tried infinitely many times, but the best move will be played almost > always. A move that didn't look as good in earlier playouts will have > a large variance, which will eventually make it be picked. > > Think about it this way: At every stage you pick the move that > maximizes average+S*standard_deviation, where S stands for "Something" > :). If S is zero, you'll simply select that move that has the best > average. If S is infinity, you'll pick the move that has been tried > the least often (largest standard deviation). If S is somewhere in > between, there is a balance. But since S is typically something like > sqrt(log(number_of_visits)), it tends to infinity, so every move will > be given another chance eventually. Since S grows very slowly, the > best move will still be picked most of the time. If you work out the > Math, you'll find that the best move is picked with probability 1 > asymptotically.
Let me try it with numbers. at a certain point there are only 2 options A a small winning sequence and B a sequence that looks winning but in fact is losing. first we test both 1000 times A gives 1000 losses ( average 0 standard deviation 0) B gives 500 wins 500 losses ( average .5 standard deviation .5) So B will be concidered first But when will A be reconcidered? _______________________________________________ Computer-go mailing list [email protected] http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
