Re: [computer-go] analysis of UCT and other bandit algorithms for tree search

Jason House Tue, 24 Apr 2007 10:39:12 -0700

I haven't gotten to read through your paper as carefully as I'd like yet,
but I do have a few observations that may be of benefit to other readers on
the list... Mostly observation of assumptions used in the paper.


1. A max tree is used instead of minimax
2. Rewards can be more than just 1 or 0 (win or loss).
3. The goal is to find the optimal leaf rather than the optimal subtree.

I'm not trying to say anything about the usefulness of the result, but
rather helping people decide how interested they'd be in the paper.  There
are likely counters to the points above.  For example, the addresses #1 by
saying "the minimax problem is a direct generalization of the results
presented"

On 4/24/07, Remi Munos <[EMAIL PROTECTED]> wrote:


Hi,

I have been encouraged to post a message to this list because we wrote a
paper
on the analysis of UCT and other bandit-based methods for tree search, and
this might be useful for computer-go.

The paper "Bandit Algorithms for Tree Search" is available as an INRIA
technical report at: http://hal.inria.fr/inria-00136198

In short, we start by explaining why UCT may perform very poorly in some
cases
because of excessive optimism. The regret of UCT is asymptotically O(log
n)
but with a initial period that may be prohibitively long.

In order to avoid this problem, we need to modify the confidence sequence
so
that to increase exploration when it is necessary. We propose an algorithm
(bandit algorithm in smooth trees) that takes into account possible
smoothness in the tree, namely that the "bias" of the rewards along an
optimal branch is decreasing with the depth of the considered nodes in the
branch, where the "bias" of a node is defined as the difference between
the
value of the node and the expectation of the rewards obtained from that
node
(ie. the difference between the max, or min-max, value and a Monte-Carlo
estimate of its value). Like UCT, the choice of an action depends on upper
confidence bounds, but here, the bound of each node is computed in terms
of
the bounds of its children, as well as a direct Chernoff-Hoeffding bound
based on the average rewards and the smoothness of the tree. We show that
the
algorithm spend a O(log(n)) time on suboptimal branches, thus essentially
exploring the optimal branch. The prior smoothness coefficient of the tree
is
a parameter of the algorithm. If set to 0, this algorithm is nothing else
than UCT. If set to infinity, this algo is a flat-UCB performed on the
leaves. If a correct parameter is set, it performs better than both
extremes.
Of course the setting of a correct value of the parameter for the specific
game of go is left to the experts, which I am not, unfortunately!

Since several go programs are using UCT ideas, we thought this modified
version may be of interest. The paper consider the max-max tree case but
extension to min-max search is straightforward.

Best, Remi Munos

_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] analysis of UCT and other bandit algorithms for tree search

Reply via email to