There has been some talk here of using a zero exploration coefficient. Does
> this literally mean using the win ratio (with one "dummy" win per node) to
> decide paths through the MC tree? It seems that the best move could easily
> be eliminated by a couple of bad runs.
>
> Does this only work when using RAVE/AMAF?
>

I can at least explain how is this exploration in MoGo.

For the case with Rave/Amaf, we have 0 in front of the UCB-like term
sqrt(log(...)/...).

For a long time, the exploration was a linear compromise between the
Amaf-winRate and the standard winRate, without other term, and in particular
no optimistic term. However:

- the winRates are "regularized", i.e. it is for example
(nbWins+K)/(nbLosses+2K), or
   something like that which avoids bad luck. This simple trick is, I think,
central in
   avoiding bad luck.

- since we have patterns, we added a third term; in early versions, this
term was a
  coefficient between 0 and 1, and the linear combination between the three
terms
  was weighted so that the sum was equal to 1 - there was still something
which
  was an estimate of success rate, without "optimism in front of
uncertainty".

- then, we had a real improvement by adding an "optimistic" exploration
term, using
   the pattern value: +mangoPatternValue/log(nbSimulationsForThisMove+2).
This
  decreases very slowly (logarithmically), with a small initial value - it's
nearly a
  small systematic bias.

By the way, the conditions for consistency in Astar, which is quite related
to Monte-Carlo Tree Search in my humble opinion, imply optimism in the sense
that the value must be overestimated. UCT/MCTS is really similar to Astar
without so-called "close set".

Best regards,
Olivier
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to