On Sat, Oct 30, 2010 at 6:58 PM, Willemien <[email protected]> wrote:
> On Sat, Oct 30, 2010 at 11:24 PM, Álvaro Begué <[email protected]> wrote:
>> On Sat, Oct 30, 2010 at 12:26 PM, Willemien <[email protected]> wrote:
>>> On Sat, Oct 30, 2010 at 5:00 PM, Álvaro Begué <[email protected]> 
>>> wrote:
>>>> On Sat, Oct 30, 2010 at 11:35 AM, Willemien <[email protected]> wrote:
>>>>> On life in 19x 19 
>>>>> http://www.lifein19x19.com/forum/viewtopic.php?p=37588#p37588
>>>>>
>>>>> i found the statement
>>>>> .
>>>>> " I think there exists a proof that MCTS gives the same results as
>>>>> normal minimax search when the number of playouts goes to infinity."
>>>>>
>>>>> Does this proof really exist and where can i find it?
>>>>
>>>> I haven't seen a formal proof, but it doesn't seem hard. Imagine you
>>>> are in a node such that every move leads to a terminal node. Since
>>>> UCB1 (or whatever other approach to the multi-armed bandit problem you
>>>> are using) will pick the best move a fraction of the time that
>>>> converges to 1, the average result it will return when exploring it
>>>> will be the correct minimax value of the [depth one] tree. Now proceed
>>>> by induction on the depth of the tree.
>>>>
>>>
>>> That is what i don't believe, for having a  chanche of 1 for every
>>> move you must be able to go back on a move (the move was not as bad as
>>> tought before) and that mechanism is  i missing in MCTS
>>
>> This seems to indicate that you didn't understand how move selection
>> works in UCB1. If you play infinitely many times, every move will be
>> tried infinitely many times, but the best move will be played almost
>> always. A move that didn't look as good in earlier playouts will have
>> a large variance, which will eventually make it be picked.
>>
>> Think about it this way: At every stage you pick the move that
>> maximizes average+S*standard_deviation, where S stands for "Something"
>> :). If S is zero, you'll simply select that move that has the best
>> average. If S is infinity, you'll pick the move that has been tried
>> the least often (largest standard deviation). If S is somewhere in
>> between, there is a balance. But since S is typically something like
>> sqrt(log(number_of_visits)), it tends to infinity, so every move will
>> be given another chance eventually. Since S grows very slowly, the
>> best move will still be picked most of the time. If you work out the
>> Math, you'll find that the best move is picked with probability 1
>> asymptotically.
>
> Let me try it with numbers.
> at a certain point there are only 2 options
> A a small winning sequence and B a sequence that looks winning but in
> fact is losing.
>
> first we test both 1000 times
>
> A gives 1000 losses               ( average 0 standard deviation 0)
> B gives  500 wins 500 losses  ( average .5   standard deviation .5)
>
> So B will be concidered first
>
> But when will A be reconcidered?

Ah, the problem is that the standard deviation is not what you think
it is (and I think I might have made a similar mistake earlier in this
thread, so it's very understandable.

Here's one way to think about it. There is a hidden probability of
winning the simulations that start with move A. Initially, since we
don't know anything about this probability, we'll use a prior that is
a uniform distribution in the interval [0,1]. After 1000 losses, using
Bayes' theorem we can prove that the posterior probability is a beta
distribution with parameters alpha = 1, beta = 1001. This distribution
has expected value 1/1002 and standard deviation
sqrt(alpha*beta/((alpha+beta)^2*(alpha+beta+1))) =
sqrt(1001/1007016012) = 0.00099700847656419935698...

(Note: UCB1 actually uses something like sqrt(1/(alpha+beta-2)) as the
estimate for the standard deviation, which is correct up to
multiplication by a constant if alpha and beta are about the same.
Otherwise it overestimates the standard deviation, especially if the
frequency of wins is close to 0 or 1.)

So eventually S will become large enough that A will be chosen.
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Reply via email to