Re: [computer-go] analysis of UCT and other bandit algorithms for tree search

Remi Munos Wed, 25 Apr 2007 03:17:32 -0700

Hi

> I haven't gotten to read through your paper as carefully as I'd like yet,
> but I do have a few observations that may be of benefit to other readers on
> the list... Mostly observation of assumptions used in the paper.
>
> 1. A max tree is used instead of minimax

Yes. For min-max trees, several things may change in the proofs but the 
results are similar and the BAST algorithm is basically the same: 
for an even node i, the bound Bi is the minimum of:
* the max of the children's bound
* the average rewards obtained from node i + confidence interval term (similar 
to UCT) + smoothness constant term (which has to be carefully set, I guess 
mostly empirically)
And for an odd node i, the bound Bi is the maximum of:
* the min of the children's bound
* the average rewards from node i - confidence interval term - smoothness 
constant term.

> 2. Rewards can be more than just 1 or 0 (win or loss).

yes. they are assumed in [0,1]. In the growing tree approach (last section), 
from each node, a reward is obtained whose expected value is assumed to be 
close to the true (min-max) value of the node up to the bias term (smoothness 
constant term). I think this is compatible with the use of monte-carlo 
simulations, where, from a node, a random (or quasi-random) simulation is run 
until the end (when the board is full). Thus, even though the final outcome 
of the full run is 0 or 1 (win or loss), the expected value of the node from 
which the run is started has a expected value with is a real number between 0 
and 1 (because of inherent randomness of the monte-carlo method). Thus, given 
a simulation device (monte-carlo with or without patterns), the expected 
value of a node is a real number. Of course the true min-max value of each 
node is 0 or 1.

> 3. The goal is to find the optimal leaf rather than the optimal subtree.

I'm not sure to understand this comment. Actually the goal is to find a 
close-to optimal play at the root level. Thus finding a close-to-optimal 
min-max leaf achieves that purpose.

Cheers, Remi

> I'm not trying to say anything about the usefulness of the result, but
> rather helping people decide how interested they'd be in the paper.  There
> are likely counters to the points above.  For example, the addresses #1 by
> saying "the minimax problem is a direct generalization of the results
> presented"
>
> On 4/24/07, Remi Munos <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I have been encouraged to post a message to this list because we wrote a
> > paper
> > on the analysis of UCT and other bandit-based methods for tree search,
> > and this might be useful for computer-go.
> >
> > The paper "Bandit Algorithms for Tree Search" is available as an INRIA
> > technical report at: http://hal.inria.fr/inria-00136198
> >
> > In short, we start by explaining why UCT may perform very poorly in some
> > cases
> > because of excessive optimism. The regret of UCT is asymptotically O(log
> > n)
> > but with a initial period that may be prohibitively long.
> >
> > In order to avoid this problem, we need to modify the confidence sequence
> > so
> > that to increase exploration when it is necessary. We propose an
> > algorithm (bandit algorithm in smooth trees) that takes into account
> > possible smoothness in the tree, namely that the "bias" of the rewards
> > along an optimal branch is decreasing with the depth of the considered
> > nodes in the branch, where the "bias" of a node is defined as the
> > difference between the
> > value of the node and the expectation of the rewards obtained from that
> > node
> > (ie. the difference between the max, or min-max, value and a Monte-Carlo
> > estimate of its value). Like UCT, the choice of an action depends on
> > upper confidence bounds, but here, the bound of each node is computed in
> > terms of
> > the bounds of its children, as well as a direct Chernoff-Hoeffding bound
> > based on the average rewards and the smoothness of the tree. We show that
> > the
> > algorithm spend a O(log(n)) time on suboptimal branches, thus essentially
> > exploring the optimal branch. The prior smoothness coefficient of the
> > tree is
> > a parameter of the algorithm. If set to 0, this algorithm is nothing else
> > than UCT. If set to infinity, this algo is a flat-UCB performed on the
> > leaves. If a correct parameter is set, it performs better than both
> > extremes.
> > Of course the setting of a correct value of the parameter for the
> > specific game of go is left to the experts, which I am not,
> > unfortunately!
> >
> > Since several go programs are using UCT ideas, we thought this modified
> > version may be of interest. The paper consider the max-max tree case but
> > extension to min-max search is straightforward.
> >
> > Best, Remi Munos
> >
> > _______________________________________________
> > computer-go mailing list
> > [email protected]
> > http://www.computer-go.org/mailman/listinfo/computer-go/
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] analysis of UCT and other bandit algorithms for tree search

Reply via email to