>I'm curious to find out what is meant by "lazy".  If, as I am led to
>believe by your report, Monte Carlo strategies applied to Double Step
>Races are "lazy", yet they converge to perfect play, then I'm not sure
>why we are meant to worry.

We worry because the 7 facts below combine to propagate bad decisions
by playouts into bad moves.

1), UCT's convergence guarantee is probabilistic. That is, given
an epsilon, the probability of having an optimal move will eventually
be less than epsilon.

2) the convergence guarantee of UCT is very weak. UCT guarantees
that if a node has T trials then every move will receive O(log T) trials.
The implication is that the worst case number of trials for achieving a
given epsilon is O(exp(-epsilon)). And that is just for a *single* node.
If you have a tree of depth N, then the worst case guarantee is
O(exp(exp(exp(exp...N times...(exp(epsilon)))).

3) the quality of MC playouts is not guaranteed. It follows that
UCT is optimizing the *sum* of the game theoretic score and the *bias*
of the playouts.

4) UCT search cannot be deep enough to solve all of the problems
on the Go board consecutively. It will make some progress down some
of those variations, but then push the rest into the playouts.

5) Forcing moves, such as atari-connect, exist in great numbers in
well-played Go positions. "Well-played" here means that the players
are not wasting moves. "No wasted moves" means that the balance
between life and death is decided by a single turn. "Decided by a
single turn" means that many forcing moves exist, and they come to
nothing with correct play, but will win if an error is made.


Imagine a situation that the playouts do not handle well, such
as Ingo's races. The outcome is bad for Black in alternating play. But
the playout mishandles it, perhaps because of laziness, and sees
that Black wins 40%. What will UCT do?

Suppose that UCT requires 1000 trials to solve that problem (e.g.,
Black realizes that when White plays E5 then Black will lose a string).
For Black, UCT will prefer a sequence of forcing moves (elsewhere)
that push the loss over the horizon. If White's opportunity to play
E5 is delayed long enough, then fewer than 1000 trials will be available
to search White's key move, so Black will be happy because the
probability of loss is reduced from 100% to 60%.

If UCT searches deeply enough to work through all forcing moves
whose "temperature" is higher than the difference between winning
and losing that local situation, then Black will start to see that
there is no reason to delay the inevitable, and begin to think about
alternative plays. But then you have to overcome two final fact:

6) When UCT starts thinking about a winning move after trying
losing moves for thousands of iterations, then it will require *many*
thousands of iterations to change its mind about the eval of the position.
E.g., if you have 1000 trials of moves that win 40%, and now you
find a move that wins 55%, then you need 2000 more trials for the
score of the node to recover to 50%.

7) Solving a problem down one branch is not enough. You must solve
it on all sequences.

Combine exponential worst-case time with the slowness in recovering
from an error, and with the transmission delay in propagating changes
through the tree, and with the need to refute forcing moves in all
sequences. Do this on a board of size 19x19 that has 5 groups of each
color and 10 active local battlefields between them, and you have the
makings of a combinatorial explosion of nuclear magnitude.


_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to