Le Mercredi 25 Avril 2007 15:49, vous avez écrit :
> On 4/25/07, Remi Munos <[EMAIL PROTECTED]> wrote:
> > > 3. The goal is to find the optimal leaf rather than the optimal
> > > subtree.
> >
> > I'm not sure to understand this comment. Actually the goal is to find a
> > close-to optimal play at the root level. Thus finding a close-to-optimal
> > min-max leaf achieves that purpose.
>
> I think that was based off of two things...
> a) The analysis of max time to "find the 1", and
> b) The pure bandit problem applied directly to the leaves of the tree.
>
> When I did sit down and read the paper for real, I saw that both of those
> things were just building up the support for the BAST algorithm.  (a) is to
> justify accepting a higher regret (in exchange for better worst case
> performance).  

well, this is not really accepting higher regret. I would say that this is to 
guarantee lower regret in all cases. UCT has (non-asymptotic, ie. for a fixed 
stage n) low regret (in the worst case) only when the tree is very smooth, 
ie. when the branches that appear good (from obtained rewards so far) are 
actually good and the branches that appear bad are truly bad. Of course, I 
don't know how much true this is for go. But in bad cases, UCT is terribly 
bad! Basically, one cannot expect UCT to have (for a fixed n) a regret 
independent of D (depth of the tree) unless the tree is smooth with a 
smoothness constant = 0 (ie. all values are the same!). I see BAST as a 
natural extension of UCT when the tree is not that smooth. If no smoothness 
exists at all, I think flat-UCB is the best (for a given tree of depth D).

> (b) is showing the non-UCT extreme of the BAST algorithm 
> (L=infinity).  (If I understand correctly, L=0 is UCT)

yes. 

>
> PS: I was surprised that the bar chart near the end had L=infinity, L=20,
> and L=7, but not L=0 plotted on it.

good point. Actually we did these experiments but not showed them in this tech 
report because of the high variance in the performance. The point is that we 
used a different confidence interval sequence for our analysis than what is 
used usually in UCT (ie. sqrt(log(t^2 \beta^{-1}) / t) instead of 
sqrt(log(n) / t), thus we obtain high probability bounds (with proba 1-\beta) 
on the regret instead of obtaining bounds on the expected regret. Using this 
confidence sequence, in our experiments, for L=0, we obtained for different 
simulations (but using the same function) sometimes very low regret (better 
than for L=7), but other times very high regret (much worst than for L=7). It 
seems this is very dependent on the actual rewards that have been received. 
This high variance in the simulations does not appear for higher L values. 
Instead of trying to describe this phenomenon, we simply did not mention it. 
But from your comment, we would definitely do that in the final version of 
the paper.
Thanks,
Remi
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to