[computer-go] Re: Amsterdam 2007 paper

David Silver Fri, 18 May 2007 10:44:31 -0700

I also use an online learning algorithm in RLGO to adjust featureweights during the game. I use around a million features (allpossible patterns from 1x1 up to 3x3 at all locations on the board)and update the weights online from simulated games using temporaldifference learning. I also use the sum of feature weights toestimate the value of a move, rather than a multiplicative estimate.The learning signal is only win/lose at the end of each simulation,rather than supervised learning like Remi. The results areencouraging (currently ~1820 Elo on CGOS, based on 5000 simulationsper move) for a program that does not use UCT or Monte-Carlo TreeSearch in any way.

Like Alvaro and Remi, I am convinced that online learning duringsimulation has great potential. In fact, I would go even further, andsuggest that Monte-Carlo Tree Search and UCT are at one end of alarge spectrum of promising sample-based search algorithms. We don'tneed to restrict ourselves to a tabular representation: stateabstraction (for example, using pattern features) provides a way togeneralise between positions and learn more efficiently. Nor do weneed to restrict our learning algorithm to Monte-Carlo: any onlinereinforcement learning algorithm can be applied to learn fromsimulated experience.

In classical game programming terms, we can think of this approach asdynamically learning an evaluation function. Instead of learning asingle evaluation function that applies to all positions, we tailorthe evaluation function to the current position, by learning fromsimulated experience. Rather than thinking of learning as an offlineprocedure that takes months to train weights, let's think of it as anonline procedure that can run in a few seconds. Once we learn theevaluation function, it can be used in any way we please - forexample alpha-beta search (RLGO 2.1 uses a full width 5-ply search)or as a heuristic function for UCT (current research in MoGo).

I would be very interested to hear more about Dimwit's approach, andalso Remi's experiments with online learning in CrazyStone.


-Dave

PS Apologies if you receive this twice, I had some difficulties posting.

Rémi:

John and I are planning a similar usage of patterns and features in
the MC simulations, although we were thinking of a very different
method to update "strengths". Although we derived our formulas from an
extension of the logistic regression instead of Bradley-Terry models,
we arrived at very similar expressions. The main difference is that
what we have been calling "score" seems to be the log of your
"strength", and instead of multiplying strengths when considering
"teams", we just add the scores of different features. We use an
online learning algorithm to adjust the scores during the game, which
severely limits the kind of algorithms we can use, but also allows us
to learn things that apply specifically to the situations that are
appearing on the board in this game.

Now we only have to make dimwit 400 points stronger to show the world
that our approach works. :)

There are many things in the paper that we had never thought of, like
considering the distance to the penultimate move. Most of those things
won't be directly applicable to dimwit, but it gives us many
suggestions of things to try.

Thanks for this important contribution to computer go.


Álvaro.

_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Re: Amsterdam 2007 paper

Reply via email to