Re: [Computer-go] Exploration formulas for UCT

2011-01-02 Thread Brian Sheppard
Let me preface my answer by saying that in no way does Pebbles formula
represent a thoroughly researched effort. Actually, I have cobbled together
a lot of ideas from theory and practice, and tweaked the rules without
extensive testing.

Looking at my code, it seems like I have 10x as many comments about possible
alternatives and variations than about actual tests.

Pebbles uses

(1-beta) * qUCT + beta * qRAVE;

Beta is set according to Silver's thesis.

Both qUCT and qRAVE incorporate exploration terms. My current implementation
uses the exploration term from the Beta-Distribution formula, from
https://webdisk.lclark.edu/drake/publications/BetaDistribution.pdf. This
differs from the standard exploration UCT exploration policy in that it is
theoretically correct for binary data.

For the record, I did not see a significant change in overall strength when
using the beta-distribution formula for exploration. There were cases where
it was better and it was a better theoretical match, and that was enough for
me to switch.

I have not tested dropping the exploration terms from the qRAVE estimate.

Pebbles initializes RAVE data with wins and losses based upon patterns. The
weights for the patterns are not ELO-based. Instead, they are slowly tweaked
to optimize competitive results. I have not tried ELO (or any other a-priori
tuning method) because I don't trust the model. There is no reason why the
optimal weights for the purpose of searching should match the frequency at
which moves are played in games.

Pebbles patterns are nothing special. 3x3, atari, CFG distance,...

Looking at my code, I notice many special cases directed at limiting
behavior. For example, the following snippet is important:

// Preserves the consistency of MCTS in the sense that it will
eventually
// search everything even if qRAVE seriously underestimates qUCT.
// Verified by Olivier Teytaud.
//
if (qRAVE < 0.1) {
qRAVE = 0.1;
}

I don't know about the optimality of the constant 0.1, nor whether this
catches all failure modes, but this line must be part of the implementation
or else I see serious failures that cannot be overcome by any reasonable
search.

Another type of limit is the following:

// Limit the bounds, since it is impossible to do more than win.
Also,
// note the assumption that every move gets one loss, and that moves
// with more trials are better. Allocating trials to moves that
// have been as successful as possible reduces the dependency on the
// exploration constant in the early life of a node. (This is a good
thing.)
//
if (metric > t / (t + 0.5)) {
metric = t / (t + 0.5);
}

This type of limit is in several places. It isn't asymptotically necessary,
or anything, but it has the effect of bounding the optimism of the
estimates. I put such code in because when you have an exploration term then
qUCT and qRAVE can exceed 1.0.

Now, there is nothing wrong with exploration terms that exceed 1.0, in the
sense that if you continue to invest trials then eventually those estimates
will come down. However, trials are at a premium in Pebbles because I don't
have much hardware. So I am very particular about what happens in the early
lifecycle of a node. My finding was that Pebbles had better results. Your
mileage may vary.

It is possible that this is unnecessary, because exploration terms based on
the beta distribution should not produce confident bounds larger than 1.0.

Hope this helps.

Brian

___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Exploration formulas for UCT

2011-01-02 Thread Petr Baudis
  Hi!

On Sun, Jan 02, 2011 at 03:53:32PM +0800, Aja wrote:
> >I guess it should be not "* 3000" but "/ 3000".
> >
> >Zen also uses this type of formula, but the constant value is rather
> >small. I use 400 for the latest version of Zen.
> 
>   If you are right, then it makes sense. For "/3000", bias is around 0.009.
>   I use 600 for Erica, similar to Zen.

  Yes, I am sorry, it's /3000.

Petr "Pasky" Baudis
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Exploration formulas for UCT

2011-01-01 Thread Aja

  Hi, Yamato,


I guess it should be not "* 3000" but "/ 3000".

Zen also uses this type of formula, but the constant value is rather
small. I use 400 for the latest version of Zen.


  If you are right, then it makes sense. For "/3000", bias is around 0.009.
  I use 600 for Erica, similar to Zen.

  Aja 


___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Exploration formulas for UCT

2011-01-01 Thread Yamato
Aja wrote:
>>  We use the Silver formula:
>>
>> rave_visits / (rave_visits + real_visits + rave_visits * real_visits * 
>> 3000)
>>
>> The figure of 3000 is surprisingly resilient. Even with radically
>> different heuristics and playouts, it stays the empirical optimum.
>
>   Interesting. According to Sylvain's original post here, that means you 
>set bias to sqrt(3000/4)=27.386... But is not bias should be in the range 
>[0,1]?

I guess it should be not "* 3000" but "/ 3000".

Zen also uses this type of formula, but the constant value is rather
small. I use 400 for the latest version of Zen.

--
Yamato
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Exploration formulas for UCT

2011-01-01 Thread Aja

  Hi petr,


 We use the Silver formula:

rave_visits / (rave_visits + real_visits + rave_visits * real_visits * 
3000)


The figure of 3000 is surprisingly resilient. Even with radically
different heuristics and playouts, it stays the empirical optimum.


  Interesting. According to Sylvain's original post here, that means you 
set bias to sqrt(3000/4)=27.386... But is not bias should be in the range 
[0,1]?


 Aja


___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Exploration formulas for UCT

2011-01-01 Thread Aja

Hi Hiroshi,

(1 - beta) * (win_rate + 0.31 * sqrt( ln(parent_visits) / child_visits)) + 
beta (rave_win_rate *  0.31 * sqrt( ln(rave_parent_visits) / 
rave_child_visits))


  I suggest to take off the exploration_term of RAVE, just like Silver 
suggested in his PhD thesis. Considering exploration for RAVE is a bit 
meaningless, since in a node normally all moves are updated at the same 
time.


UCT searches B(E5),W(D3),B(C5),W(F7), and in this position, playout 
searches

B(E7),W(E8),B(D8),W(F8),B(D7)...Black win.

In W(D3) positions, Aya updates RAVE and UCT,
Updates  C5(UCT)
Updates  C5(RAVE)
Updates  E7(RAVE)
Updates  D8(RAVE)
Updates  D7(RAVE)

I think "Updates C5(RAVE)" is strange, but I could not get good result 
without this.


  I can't see why it is strange and wonder why do you think so. In Erica, I 
update C5(RAVE) as well.


 Aja


___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Exploration formulas for UCT

2011-01-01 Thread Hiroshi Yamashita

For Aya,

(1 - beta) * (win_rate + 0.31 * sqrt( ln(parent_visits) / child_visits)) + beta (rave_win_rate *  0.31 * sqrt( 
ln(rave_parent_visits) / rave_child_visits))


beta = sqrt(100 / (3 * child_visits + 100));

Aya uses Progressive Windening. High order N moves are only considerd.

PW_sort_N = ln(parent_visits/ 40.0) / ln(1.4) +2;

Moves are sorted sometimes by rave value, Criticality, and MC owners.


I also would like to know how to count rave.

UCT searches B(E5),W(D3),B(C5),W(F7), and in this position, playout searches
B(E7),W(E8),B(D8),W(F8),B(D7)...Black win.

In W(D3) positions, Aya updates RAVE and UCT,
Updates  C5(UCT)
Updates  C5(RAVE)
Updates  E7(RAVE)
Updates  D8(RAVE)
Updates  D7(RAVE)

I think "Updates C5(RAVE)" is strange, but I could not get good result without 
this.

Hiroshi Yamashita


- Original Message - 
From: "David Fotland" 

To: 
Sent: Sunday, January 02, 2011 5:18 AM
Subject: [Computer-go] Exploration formulas for UCT


It would be interesting to see the actual formulas used for choosing the more 
to try in the tree part of the search.

For Many Faces, it is:

(1 – beta) * (win_rate + 0.45 * sqrt( ln(parent_visits) / child visits)) +

beta * rave_win_rate + mfgo_bias

beta is the old Mogo formula of sqrt(500/(500 + 3 * parent_visits))

A child with no visits has a win_rate of 1.1.  Otherwise there is no win_rate 
bias.

rave wins and visits are strongly biased when moves are generated using various rules and information from the mfgo 
move generator (in a range of 10% to 90% win rate, with hundreds to thousands of visits).


mfgo_bias is unchanging, per move, within a range of about +-2%, based on mfgo’s move generator’s estimate of the 
quality of the move.


Does anyone else want to share?

David


___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] Exploration formulas for UCT

2011-01-01 Thread Petr Baudis
  Hi!

On Sat, Jan 01, 2011 at 12:18:46PM -0800, David Fotland wrote:
> For Many Faces, it is:
> 
>  
> 
> (1 – beta) * (win_rate + 0.45 * sqrt( ln(parent_visits) / child visits)) + 
> beta * rave_win_rate + mfgo_bias 

  Pachi:

(1 - beta) * (win_rate) + beta * (rave_win_rate)

  "Even game prior" is essential - prioring win_rate with 0.5 at n
playouts, where n can be between 7 and 40.

  On 9x9, 0.02 * sqrt(...) can be beneficial (in the order of few ELO,
I think) in some setups, but it seems its effect can be equated by
twiddling other values.

> beta is the old Mogo formula of sqrt(500/(500 + 3 * parent_visits))

  We use the Silver formula:

rave_visits / (rave_visits + real_visits + rave_visits * real_visits * 
3000)

The figure of 3000 is surprisingly resilient. Even with radically
different heuristics and playouts, it stays the empirical optimum.

> A child with no visits has a win_rate of 1.1.  Otherwise there is no win_rate 
> bias.
> 
> rave wins and visits are strongly biased when moves are generated using 
> various rules and information from the mfgo move generator (in a range of 10% 
> to 90% win rate, with hundreds to thousands of visits).
> 
> mfgo_bias is unchanging, per move, within a range of about +-2%, based on 
> mfgo’s move generator’s estimate of the quality of the move.

  We call our bias a "prior" and simply seed either the win_rate
or rave_win_rate with it upon node expansion - it makes almost
no difference if we use normal or rave win_rate, surprisingly.

  Of course we have many sources of priors. Each prior contributes
certain winrate value (but usually either 0.0 or 1.0) weighted with
certain number of visits - again between 7 and 40.

  Aside of the even game prior (which is essential for the search simply
to work at all in our setup - and it is also another way to compensate
for the missing exploration term), the most important prior is the
playout policy hinter, using the same heuristics (and code) as the
playout policy to pick good tree moves. On the 19x19, line height prior
(1st line 0.0, 3rd line 1.0) and especially the CFG-to-last-move prior
are essential.

-- 
Petr "Pasky" Baudis
Computer science education cannot make an expert programmer any more
than studying brushes and pigment can make an expert painter. --esr
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] Exploration formulas for UCT

2011-01-01 Thread Aja
For Erica, it's almost the same with Many Faces, except that I use 
progressive_bias not mfgo_bias (of course, if David send the details of 
mfgo_bias to me, I will use :)

I compute beta by David Silver's formula (70 elo stronger than the original 
one) and UCT_C is set to 0.6. I can't get any good result from smaller UCT_C.

Aja

  - Original Message - 
  From: David Fotland 
  To: computer-go@dvandva.org 
  Sent: Sunday, January 02, 2011 4:18 AM
  Subject: [Computer-go] Exploration formulas for UCT


  It would be interesting to see the actual formulas used for choosing the more 
to try in the tree part of the search.

   

  For Many Faces, it is:

   

  (1 – beta) * (win_rate + 0.45 * sqrt( ln(parent_visits) / child visits)) +

  beta * rave_win_rate + mfgo_bias 

   

  beta is the old Mogo formula of sqrt(500/(500 + 3 * parent_visits))

   

  A child with no visits has a win_rate of 1.1.  Otherwise there is no win_rate 
bias.

   

  rave wins and visits are strongly biased when moves are generated using 
various rules and information from the mfgo move generator (in a range of 10% 
to 90% win rate, with hundreds to thousands of visits).

   

  mfgo_bias is unchanging, per move, within a range of about +-2%, based on 
mfgo’s move generator’s estimate of the quality of the move.

   

  Does anyone else want to share?

   

  David

   

  From: computer-go-boun...@dvandva.org 
[mailto:computer-go-boun...@dvandva.org] On Behalf Of Fuming Wang
  Sent: Saturday, January 01, 2011 9:00 AM
  To: Aja; computer-go@dvandva.org
  Subject: Re: [Computer-go] Fwd: News on Tromp-Cook ?

   

  Hi Aja,



  On Sun, Jan 2, 2011 at 12:16 AM, Aja  wrote:

  Hi Fuming,

   

  Most of the current strong programs are using UCT combined with RAVE (a kind 
of AMAF). The formula is like this (there are many variants),

   

  C*RAVE+(1-C)*UCT


  This has been my understanding. However, I am surprized to find out that 
people have been setting C close to one, according to Petr and Oliver's 
postings, which is essentially AMAF. MF apparently is doing something different.

  Fuming



--


  ___
  Computer-go mailing list
  Computer-go@dvandva.org
  http://dvandva.org/cgi-bin/mailman/listinfo/computer-go___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

[Computer-go] Exploration formulas for UCT

2011-01-01 Thread David Fotland
It would be interesting to see the actual formulas used for choosing the more 
to try in the tree part of the search.

 

For Many Faces, it is:

 

(1 – beta) * (win_rate + 0.45 * sqrt( ln(parent_visits) / child visits)) +

beta * rave_win_rate + mfgo_bias 

 

beta is the old Mogo formula of sqrt(500/(500 + 3 * parent_visits))

 

A child with no visits has a win_rate of 1.1.  Otherwise there is no win_rate 
bias.

 

rave wins and visits are strongly biased when moves are generated using various 
rules and information from the mfgo move generator (in a range of 10% to 90% 
win rate, with hundreds to thousands of visits).

 

mfgo_bias is unchanging, per move, within a range of about +-2%, based on 
mfgo’s move generator’s estimate of the quality of the move.

 

Does anyone else want to share?

 

David

 

From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] 
On Behalf Of Fuming Wang
Sent: Saturday, January 01, 2011 9:00 AM
To: Aja; computer-go@dvandva.org
Subject: Re: [Computer-go] Fwd: News on Tromp-Cook ?

 

Hi Aja,



On Sun, Jan 2, 2011 at 12:16 AM, Aja  wrote:

Hi Fuming,

 

Most of the current strong programs are using UCT combined with RAVE (a kind of 
AMAF). The formula is like this (there are many variants),

 

C*RAVE+(1-C)*UCT


This has been my understanding. However, I am surprized to find out that people 
have been setting C close to one, according to Petr and Oliver's postings, 
which is essentially AMAF. MF apparently is doing something different.

Fuming

___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go