Re: [Computer-go] Exploration formulas for UCT
Let me preface my answer by saying that in no way does Pebbles formula represent a thoroughly researched effort. Actually, I have cobbled together a lot of ideas from theory and practice, and tweaked the rules without extensive testing. Looking at my code, it seems like I have 10x as many comments about possible alternatives and variations than about actual tests. Pebbles uses (1-beta) * qUCT + beta * qRAVE; Beta is set according to Silver's thesis. Both qUCT and qRAVE incorporate exploration terms. My current implementation uses the exploration term from the Beta-Distribution formula, from https://webdisk.lclark.edu/drake/publications/BetaDistribution.pdf. This differs from the standard exploration UCT exploration policy in that it is theoretically correct for binary data. For the record, I did not see a significant change in overall strength when using the beta-distribution formula for exploration. There were cases where it was better and it was a better theoretical match, and that was enough for me to switch. I have not tested dropping the exploration terms from the qRAVE estimate. Pebbles initializes RAVE data with wins and losses based upon patterns. The weights for the patterns are not ELO-based. Instead, they are slowly tweaked to optimize competitive results. I have not tried ELO (or any other a-priori tuning method) because I don't trust the model. There is no reason why the optimal weights for the purpose of searching should match the frequency at which moves are played in games. Pebbles patterns are nothing special. 3x3, atari, CFG distance,... Looking at my code, I notice many special cases directed at limiting behavior. For example, the following snippet is important: // Preserves the consistency of MCTS in the sense that it will eventually // search everything even if qRAVE seriously underestimates qUCT. // Verified by Olivier Teytaud. // if (qRAVE < 0.1) { qRAVE = 0.1; } I don't know about the optimality of the constant 0.1, nor whether this catches all failure modes, but this line must be part of the implementation or else I see serious failures that cannot be overcome by any reasonable search. Another type of limit is the following: // Limit the bounds, since it is impossible to do more than win. Also, // note the assumption that every move gets one loss, and that moves // with more trials are better. Allocating trials to moves that // have been as successful as possible reduces the dependency on the // exploration constant in the early life of a node. (This is a good thing.) // if (metric > t / (t + 0.5)) { metric = t / (t + 0.5); } This type of limit is in several places. It isn't asymptotically necessary, or anything, but it has the effect of bounding the optimism of the estimates. I put such code in because when you have an exploration term then qUCT and qRAVE can exceed 1.0. Now, there is nothing wrong with exploration terms that exceed 1.0, in the sense that if you continue to invest trials then eventually those estimates will come down. However, trials are at a premium in Pebbles because I don't have much hardware. So I am very particular about what happens in the early lifecycle of a node. My finding was that Pebbles had better results. Your mileage may vary. It is possible that this is unnecessary, because exploration terms based on the beta distribution should not produce confident bounds larger than 1.0. Hope this helps. Brian ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Exploration formulas for UCT
Hi! On Sun, Jan 02, 2011 at 03:53:32PM +0800, Aja wrote: > >I guess it should be not "* 3000" but "/ 3000". > > > >Zen also uses this type of formula, but the constant value is rather > >small. I use 400 for the latest version of Zen. > > If you are right, then it makes sense. For "/3000", bias is around 0.009. > I use 600 for Erica, similar to Zen. Yes, I am sorry, it's /3000. Petr "Pasky" Baudis ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Exploration formulas for UCT
Hi, Yamato, I guess it should be not "* 3000" but "/ 3000". Zen also uses this type of formula, but the constant value is rather small. I use 400 for the latest version of Zen. If you are right, then it makes sense. For "/3000", bias is around 0.009. I use 600 for Erica, similar to Zen. Aja ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Exploration formulas for UCT
Aja wrote: >> We use the Silver formula: >> >> rave_visits / (rave_visits + real_visits + rave_visits * real_visits * >> 3000) >> >> The figure of 3000 is surprisingly resilient. Even with radically >> different heuristics and playouts, it stays the empirical optimum. > > Interesting. According to Sylvain's original post here, that means you >set bias to sqrt(3000/4)=27.386... But is not bias should be in the range >[0,1]? I guess it should be not "* 3000" but "/ 3000". Zen also uses this type of formula, but the constant value is rather small. I use 400 for the latest version of Zen. -- Yamato ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Exploration formulas for UCT
Hi petr, We use the Silver formula: rave_visits / (rave_visits + real_visits + rave_visits * real_visits * 3000) The figure of 3000 is surprisingly resilient. Even with radically different heuristics and playouts, it stays the empirical optimum. Interesting. According to Sylvain's original post here, that means you set bias to sqrt(3000/4)=27.386... But is not bias should be in the range [0,1]? Aja ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Exploration formulas for UCT
Hi Hiroshi, (1 - beta) * (win_rate + 0.31 * sqrt( ln(parent_visits) / child_visits)) + beta (rave_win_rate * 0.31 * sqrt( ln(rave_parent_visits) / rave_child_visits)) I suggest to take off the exploration_term of RAVE, just like Silver suggested in his PhD thesis. Considering exploration for RAVE is a bit meaningless, since in a node normally all moves are updated at the same time. UCT searches B(E5),W(D3),B(C5),W(F7), and in this position, playout searches B(E7),W(E8),B(D8),W(F8),B(D7)...Black win. In W(D3) positions, Aya updates RAVE and UCT, Updates C5(UCT) Updates C5(RAVE) Updates E7(RAVE) Updates D8(RAVE) Updates D7(RAVE) I think "Updates C5(RAVE)" is strange, but I could not get good result without this. I can't see why it is strange and wonder why do you think so. In Erica, I update C5(RAVE) as well. Aja ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Exploration formulas for UCT
For Aya, (1 - beta) * (win_rate + 0.31 * sqrt( ln(parent_visits) / child_visits)) + beta (rave_win_rate * 0.31 * sqrt( ln(rave_parent_visits) / rave_child_visits)) beta = sqrt(100 / (3 * child_visits + 100)); Aya uses Progressive Windening. High order N moves are only considerd. PW_sort_N = ln(parent_visits/ 40.0) / ln(1.4) +2; Moves are sorted sometimes by rave value, Criticality, and MC owners. I also would like to know how to count rave. UCT searches B(E5),W(D3),B(C5),W(F7), and in this position, playout searches B(E7),W(E8),B(D8),W(F8),B(D7)...Black win. In W(D3) positions, Aya updates RAVE and UCT, Updates C5(UCT) Updates C5(RAVE) Updates E7(RAVE) Updates D8(RAVE) Updates D7(RAVE) I think "Updates C5(RAVE)" is strange, but I could not get good result without this. Hiroshi Yamashita - Original Message - From: "David Fotland" To: Sent: Sunday, January 02, 2011 5:18 AM Subject: [Computer-go] Exploration formulas for UCT It would be interesting to see the actual formulas used for choosing the more to try in the tree part of the search. For Many Faces, it is: (1 – beta) * (win_rate + 0.45 * sqrt( ln(parent_visits) / child visits)) + beta * rave_win_rate + mfgo_bias beta is the old Mogo formula of sqrt(500/(500 + 3 * parent_visits)) A child with no visits has a win_rate of 1.1. Otherwise there is no win_rate bias. rave wins and visits are strongly biased when moves are generated using various rules and information from the mfgo move generator (in a range of 10% to 90% win rate, with hundreds to thousands of visits). mfgo_bias is unchanging, per move, within a range of about +-2%, based on mfgo’s move generator’s estimate of the quality of the move. Does anyone else want to share? David ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Exploration formulas for UCT
Hi! On Sat, Jan 01, 2011 at 12:18:46PM -0800, David Fotland wrote: > For Many Faces, it is: > > > > (1 – beta) * (win_rate + 0.45 * sqrt( ln(parent_visits) / child visits)) + > beta * rave_win_rate + mfgo_bias Pachi: (1 - beta) * (win_rate) + beta * (rave_win_rate) "Even game prior" is essential - prioring win_rate with 0.5 at n playouts, where n can be between 7 and 40. On 9x9, 0.02 * sqrt(...) can be beneficial (in the order of few ELO, I think) in some setups, but it seems its effect can be equated by twiddling other values. > beta is the old Mogo formula of sqrt(500/(500 + 3 * parent_visits)) We use the Silver formula: rave_visits / (rave_visits + real_visits + rave_visits * real_visits * 3000) The figure of 3000 is surprisingly resilient. Even with radically different heuristics and playouts, it stays the empirical optimum. > A child with no visits has a win_rate of 1.1. Otherwise there is no win_rate > bias. > > rave wins and visits are strongly biased when moves are generated using > various rules and information from the mfgo move generator (in a range of 10% > to 90% win rate, with hundreds to thousands of visits). > > mfgo_bias is unchanging, per move, within a range of about +-2%, based on > mfgo’s move generator’s estimate of the quality of the move. We call our bias a "prior" and simply seed either the win_rate or rave_win_rate with it upon node expansion - it makes almost no difference if we use normal or rave win_rate, surprisingly. Of course we have many sources of priors. Each prior contributes certain winrate value (but usually either 0.0 or 1.0) weighted with certain number of visits - again between 7 and 40. Aside of the even game prior (which is essential for the search simply to work at all in our setup - and it is also another way to compensate for the missing exploration term), the most important prior is the playout policy hinter, using the same heuristics (and code) as the playout policy to pick good tree moves. On the 19x19, line height prior (1st line 0.0, 3rd line 1.0) and especially the CFG-to-last-move prior are essential. -- Petr "Pasky" Baudis Computer science education cannot make an expert programmer any more than studying brushes and pigment can make an expert painter. --esr ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Exploration formulas for UCT
For Erica, it's almost the same with Many Faces, except that I use progressive_bias not mfgo_bias (of course, if David send the details of mfgo_bias to me, I will use :) I compute beta by David Silver's formula (70 elo stronger than the original one) and UCT_C is set to 0.6. I can't get any good result from smaller UCT_C. Aja - Original Message - From: David Fotland To: computer-go@dvandva.org Sent: Sunday, January 02, 2011 4:18 AM Subject: [Computer-go] Exploration formulas for UCT It would be interesting to see the actual formulas used for choosing the more to try in the tree part of the search. For Many Faces, it is: (1 – beta) * (win_rate + 0.45 * sqrt( ln(parent_visits) / child visits)) + beta * rave_win_rate + mfgo_bias beta is the old Mogo formula of sqrt(500/(500 + 3 * parent_visits)) A child with no visits has a win_rate of 1.1. Otherwise there is no win_rate bias. rave wins and visits are strongly biased when moves are generated using various rules and information from the mfgo move generator (in a range of 10% to 90% win rate, with hundreds to thousands of visits). mfgo_bias is unchanging, per move, within a range of about +-2%, based on mfgo’s move generator’s estimate of the quality of the move. Does anyone else want to share? David From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Fuming Wang Sent: Saturday, January 01, 2011 9:00 AM To: Aja; computer-go@dvandva.org Subject: Re: [Computer-go] Fwd: News on Tromp-Cook ? Hi Aja, On Sun, Jan 2, 2011 at 12:16 AM, Aja wrote: Hi Fuming, Most of the current strong programs are using UCT combined with RAVE (a kind of AMAF). The formula is like this (there are many variants), C*RAVE+(1-C)*UCT This has been my understanding. However, I am surprized to find out that people have been setting C close to one, according to Petr and Oliver's postings, which is essentially AMAF. MF apparently is doing something different. Fuming -- ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
[Computer-go] Exploration formulas for UCT
It would be interesting to see the actual formulas used for choosing the more to try in the tree part of the search. For Many Faces, it is: (1 – beta) * (win_rate + 0.45 * sqrt( ln(parent_visits) / child visits)) + beta * rave_win_rate + mfgo_bias beta is the old Mogo formula of sqrt(500/(500 + 3 * parent_visits)) A child with no visits has a win_rate of 1.1. Otherwise there is no win_rate bias. rave wins and visits are strongly biased when moves are generated using various rules and information from the mfgo move generator (in a range of 10% to 90% win rate, with hundreds to thousands of visits). mfgo_bias is unchanging, per move, within a range of about +-2%, based on mfgo’s move generator’s estimate of the quality of the move. Does anyone else want to share? David From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Fuming Wang Sent: Saturday, January 01, 2011 9:00 AM To: Aja; computer-go@dvandva.org Subject: Re: [Computer-go] Fwd: News on Tromp-Cook ? Hi Aja, On Sun, Jan 2, 2011 at 12:16 AM, Aja wrote: Hi Fuming, Most of the current strong programs are using UCT combined with RAVE (a kind of AMAF). The formula is like this (there are many variants), C*RAVE+(1-C)*UCT This has been my understanding. However, I am surprized to find out that people have been setting C close to one, according to Petr and Oliver's postings, which is essentially AMAF. MF apparently is doing something different. Fuming ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go