After much effort, I think I understand most of the Gelly&Shriver paper[1]. I'm hoping this post will help others and possibly have people correct any errors I've made.
First, some basic definitions of notation: * In general, "Q" is an estimated winning rate, used in three ways: 1. As an estimated winning rate based on random games Q_UCT - Always used in this way Q_RAVE - Section 6, part of UCT_RAVE with 0 or 1 parameter Q_UR - Section 6, part of UCT_RAVE with 2 parameters 2. As a heuristic winning rate Q_PRIOR - Section 7, initialized with 2nd parameter of UCT_RAVE Q_RAVE - Kind of, when UCT_RAVE has 2 parameters Q_RLGO - Sections 7+ Q_MoGo - Always used in this way (don't confuse with pi_MoGo) 3. Used for selecting moves in random games Q_RLGO - unique to section 5 (authors conclude not to do this) * In general, "Q+" variants add an extra factor to Q to create an upper confidence bound (standard UCT form of sqrt(log(n_parent)/n_child)) * "pi" Represents how to select moves, used in two ways: 1. Selecting a move in random games pi_random - Select any random move pi_MoGo - Random move selection based on earlier papers? pi_epsilon- Unique to section 5 (authors conclude not to use this) pi_sigma - Unique to section 5 (authors conclude not to use this) pi_tau - Unique to section 5 (authors conclude not to use this) 2. Selecting a child inside the UCT search tree pi_UCT - Used in UCT algorithm pi_UR - Used in UCT_RAVE algorithm Sections 1&2 can be skipped. Even more notably, most notation in section 2 is unique. For a more thorough overview of reinforced learning, see "Reinforcement Learning: A Strategy" [2] (Reading through section 4.2 is more than enough) Section 3 is UCT restated in notation common to reinforced learning. Since I've already added other links, here's a link to the original paper on UCT [3] Section 4. This was the toughest for me to interpret. I found the Sutton reference [4] to both be helpful background and an interesting read (reading through section 2.3 is enough). While I don't think I could recreate exactly what was done, it appears that various patterns around a point on the board are combined (via weighted sum with learned weights) to predict the probability that a candidate move wins. Given the temporal difference nature, this may be based on the winning rate by the previous move of this color. In section 7 it is used as an initial guess at winning rate. Its use in section 5 to select random moves seems to be far inferior and unused by the authors. Section 6. In addition to the standard UCT method of evaluating move winning rates (discussed in section 3), it introduces all moves as first as an alternative. All positions where the player to play played first, are counted as if they were the first move in the game. This gives rapid winning rate (action value) estimates (RAVE), but are inaccurate. The two methods are averaged together. How much of each value is used depends on the number of pure simulations done. Initially, 100% of the RAVE value is used, but then starts dropping as the simulations through a node increases. (At 1000 sims, Beta = 50%, at 3000 sims, Beta = 10%). Other threads discuss AMAF in detail and should be enough to implement this in other MC bots. The conclusion indicates that using UCT_RAVE and using heuristic winning rates (using Q_RLGO as Q_PRIOR) is an effective combination that increased MoGo's winning rate against GnuGo 3.7.10 (level 8) from 24% to 69%. Questions that I have: * In section 4, how is phi initialized? Does it contain the number of matches on the board for each every possible pattern? Or is it a boolean 1 or 0 indicating if the pattern is present? * In section 4, what is considered local? 3x3 neighborhood? For translations of shapes, it seems like a larger local area could be used. [1] http://www.machinelearning.org/proceedings/icml2007/papers/387.pdf [2] http://arxiv.org/pdf/cs/9605103 [3] http://zaphod.aml.sztaki.hu/papers/ecml06.pdf [4] http://www.cs.ualberta.ca/~sutton/papers/sutton-88.pdf _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/