On Sat, Oct 30, 2010 at 11:24 PM, Álvaro Begué <[email protected]> wrote:
> On Sat, Oct 30, 2010 at 12:26 PM, Willemien <[email protected]> wrote:
>> On Sat, Oct 30, 2010 at 5:00 PM, Álvaro Begué <[email protected]> wrote:
>>> On Sat, Oct 30, 2010 at 11:35 AM, Willemien <[email protected]> wrote:
>>>> On life in 19x 19 
>>>> http://www.lifein19x19.com/forum/viewtopic.php?p=37588#p37588
>>>>
>>>> i found the statement
>>>> .
>>>> " I think there exists a proof that MCTS gives the same results as
>>>> normal minimax search when the number of playouts goes to infinity."
>>>>
>>>> Does this proof really exist and where can i find it?
>>>
>>> I haven't seen a formal proof, but it doesn't seem hard. Imagine you
>>> are in a node such that every move leads to a terminal node. Since
>>> UCB1 (or whatever other approach to the multi-armed bandit problem you
>>> are using) will pick the best move a fraction of the time that
>>> converges to 1, the average result it will return when exploring it
>>> will be the correct minimax value of the [depth one] tree. Now proceed
>>> by induction on the depth of the tree.
>>>
>>
>> That is what i don't believe, for having a  chanche of 1 for every
>> move you must be able to go back on a move (the move was not as bad as
>> tought before) and that mechanism is  i missing in MCTS
>
> This seems to indicate that you didn't understand how move selection
> works in UCB1. If you play infinitely many times, every move will be
> tried infinitely many times, but the best move will be played almost
> always. A move that didn't look as good in earlier playouts will have
> a large variance, which will eventually make it be picked.
>
> Think about it this way: At every stage you pick the move that
> maximizes average+S*standard_deviation, where S stands for "Something"
> :). If S is zero, you'll simply select that move that has the best
> average. If S is infinity, you'll pick the move that has been tried
> the least often (largest standard deviation). If S is somewhere in
> between, there is a balance. But since S is typically something like
> sqrt(log(number_of_visits)), it tends to infinity, so every move will
> be given another chance eventually. Since S grows very slowly, the
> best move will still be picked most of the time. If you work out the
> Math, you'll find that the best move is picked with probability 1
> asymptotically.

Let me try it with numbers.
at a certain point there are only 2 options
A a small winning sequence and B a sequence that looks winning but in
fact is losing.

first we test both 1000 times

A gives 1000 losses               ( average 0 standard deviation 0)
B gives  500 wins 500 losses  ( average .5   standard deviation .5)

So B will be concidered first

But when will A be reconcidered?
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Reply via email to