[computer-go] Combining Online and Offline Knowledge in UCT

Jason House Fri, 26 Oct 2007 18:27:25 -0700

After much effort, I think I understand most of the Gelly&Shriver
paper[1].  I'm hoping this post will help others and possibly have
people correct any errors I've made.


First, some basic definitions of notation:
* In general, "Q" is an estimated winning rate, used in three ways:
 1. As an estimated winning rate based on random games
    Q_UCT  - Always used in this way
    Q_RAVE - Section 6, part of UCT_RAVE with 0 or 1 parameter
    Q_UR   - Section 6, part of UCT_RAVE with 2 parameters
 2. As a heuristic winning rate
    Q_PRIOR - Section 7, initialized with 2nd parameter of UCT_RAVE
    Q_RAVE  - Kind of, when UCT_RAVE has 2 parameters
    Q_RLGO  - Sections 7+
    Q_MoGo  - Always used in this way (don't confuse with pi_MoGo)
 3. Used for selecting moves in random games
    Q_RLGO - unique to section 5 (authors conclude not to do this)

* In general, "Q+" variants add an extra factor to Q to create an upper
confidence bound (standard UCT form of sqrt(log(n_parent)/n_child))

* "pi" Represents how to select moves, used in two ways:
 1. Selecting a move in random games
    pi_random - Select any random move
    pi_MoGo   - Random move selection based on earlier papers?
    pi_epsilon- Unique to section 5 (authors conclude not to use this)
    pi_sigma  - Unique to section 5 (authors conclude not to use this)
    pi_tau    - Unique to section 5 (authors conclude not to use this)
 2. Selecting a child inside the UCT search tree
    pi_UCT - Used in UCT algorithm
    pi_UR  - Used in UCT_RAVE algorithm

Sections 1&2 can be skipped.  Even more notably, most notation in
section 2 is unique.  For a more thorough overview of reinforced
learning, see "Reinforcement Learning: A Strategy" [2] (Reading through
section 4.2 is more than enough)

Section 3 is UCT restated in notation common to reinforced learning.
Since I've already added other links, here's a link to the original
paper on UCT [3]

Section 4.
  This was the toughest for me to interpret.  I found the Sutton
reference [4] to both be helpful background and an interesting read
(reading through section 2.3 is enough).
  While I don't think I could recreate exactly what was done, it appears
that various patterns around a point on the board are combined (via
weighted sum with learned weights) to predict the probability that a
candidate move wins.  Given the temporal difference nature, this may be
based on the winning rate by the previous move of this color.
  In section 7 it is used as an initial guess at winning rate.  Its use
in section 5 to select random moves seems to be far inferior and unused
by the authors.

Section 6.
  In addition to the standard UCT method of evaluating move winning
rates (discussed in section 3), it introduces all moves as first as an
alternative.  All positions where the player to play played first, are
counted as if they were the first move in the game.  This gives rapid
winning rate (action value) estimates (RAVE), but are inaccurate.  The
two methods are averaged together.  How much of each value is used
depends on the number of pure simulations done.  Initially, 100% of the
RAVE value is used, but then starts dropping as the simulations through
a node increases.  (At 1000 sims, Beta = 50%, at 3000 sims, Beta = 10%).
Other threads discuss AMAF in detail and should be enough to implement
this in other MC bots.

The conclusion indicates that using UCT_RAVE and using heuristic winning
rates (using Q_RLGO as Q_PRIOR) is an effective combination that
increased MoGo's winning rate against GnuGo 3.7.10 (level 8) from 24% to
69%.

Questions that I have:
* In section 4, how is phi initialized?  Does it contain the number of
matches on the board for each every possible pattern?  Or is it a
boolean 1 or 0 indicating if the pattern is present?
* In section 4, what is considered local?  3x3 neighborhood?  For
translations of shapes, it seems like a larger local area could be used.

[1] http://www.machinelearning.org/proceedings/icml2007/papers/387.pdf
[2] http://arxiv.org/pdf/cs/9605103
[3] http://zaphod.aml.sztaki.hu/papers/ecml06.pdf
[4] http://www.cs.ualberta.ca/~sutton/papers/sutton-88.pdf


_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Combining Online and Offline Knowledge in UCT

Reply via email to