There has been some talk here of using a zero exploration coefficient. Does > this literally mean using the win ratio (with one "dummy" win per node) to > decide paths through the MC tree? It seems that the best move could easily > be eliminated by a couple of bad runs. > > Does this only work when using RAVE/AMAF? >
I can at least explain how is this exploration in MoGo. For the case with Rave/Amaf, we have 0 in front of the UCB-like term sqrt(log(...)/...). For a long time, the exploration was a linear compromise between the Amaf-winRate and the standard winRate, without other term, and in particular no optimistic term. However: - the winRates are "regularized", i.e. it is for example (nbWins+K)/(nbLosses+2K), or something like that which avoids bad luck. This simple trick is, I think, central in avoiding bad luck. - since we have patterns, we added a third term; in early versions, this term was a coefficient between 0 and 1, and the linear combination between the three terms was weighted so that the sum was equal to 1 - there was still something which was an estimate of success rate, without "optimism in front of uncertainty". - then, we had a real improvement by adding an "optimistic" exploration term, using the pattern value: +mangoPatternValue/log(nbSimulationsForThisMove+2). This decreases very slowly (logarithmically), with a small initial value - it's nearly a small systematic bias. By the way, the conditions for consistency in Astar, which is quite related to Monte-Carlo Tree Search in my humble opinion, imply optimism in the sense that the value must be overestimated. UCT/MCTS is really similar to Astar without so-called "close set". Best regards, Olivier
_______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/