I also use an online learning algorithm in RLGO to adjust feature weights during the game. I use around a million features (all possible patterns from 1x1 up to 3x3 at all locations on the board) and update the weights online from simulated games using temporal difference learning. I also use the sum of feature weights to estimate the value of a move, rather than a multiplicative estimate. The learning signal is only win/lose at the end of each simulation, rather than supervised learning like Remi. The results are encouraging (currently ~1820 Elo on CGOS, based on 5000 simulations per move) for a program that does not use UCT or Monte-Carlo Tree Search in any way.

Like Alvaro and Remi, I am convinced that online learning during simulation has great potential. In fact, I would go even further, and suggest that Monte-Carlo Tree Search and UCT are at one end of a large spectrum of promising sample-based search algorithms. We don't need to restrict ourselves to a tabular representation: state abstraction (for example, using pattern features) provides a way to generalise between positions and learn more efficiently. Nor do we need to restrict our learning algorithm to Monte-Carlo: any online reinforcement learning algorithm can be applied to learn from simulated experience.

In classical game programming terms, we can think of this approach as dynamically learning an evaluation function. Instead of learning a single evaluation function that applies to all positions, we tailor the evaluation function to the current position, by learning from simulated experience. Rather than thinking of learning as an offline procedure that takes months to train weights, let's think of it as an online procedure that can run in a few seconds. Once we learn the evaluation function, it can be used in any way we please - for example alpha-beta search (RLGO 2.1 uses a full width 5-ply search) or as a heuristic function for UCT (current research in MoGo).

I would be very interested to hear more about Dimwit's approach, and also Remi's experiments with online learning in CrazyStone.

-Dave

PS Apologies if you receive this twice, I had some difficulties posting.

Rémi:

John and I are planning a similar usage of patterns and features in
the MC simulations, although we were thinking of a very different
method to update "strengths". Although we derived our formulas from an
extension of the logistic regression instead of Bradley-Terry models,
we arrived at very similar expressions. The main difference is that
what we have been calling "score" seems to be the log of your
"strength", and instead of multiplying strengths when considering
"teams", we just add the scores of different features. We use an
online learning algorithm to adjust the scores during the game, which
severely limits the kind of algorithms we can use, but also allows us
to learn things that apply specifically to the situations that are
appearing on the board in this game.

Now we only have to make dimwit 400 points stronger to show the world
that our approach works. :)

There are many things in the paper that we had never thought of, like
considering the distance to the penultimate move. Most of those things
won't be directly applicable to dimwit, but it gives us many
suggestions of things to try.

Thanks for this important contribution to computer go.


Álvaro.

_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to