I also use an online learning algorithm in RLGO to adjust feature
weights during the game. I use around a million features (all
possible patterns from 1x1 up to 3x3 at all locations on the board)
and update the weights online from simulated games using temporal
difference learning. I also use the sum of feature weights to
estimate the value of a move, rather than a multiplicative estimate.
The learning signal is only win/lose at the end of each simulation,
rather than supervised learning like Remi. The results are
encouraging (currently ~1820 Elo on CGOS, based on 5000 simulations
per move) for a program that does not use UCT or Monte-Carlo Tree
Search in any way.
Like Alvaro and Remi, I am convinced that online learning during
simulation has great potential. In fact, I would go even further, and
suggest that Monte-Carlo Tree Search and UCT are at one end of a
large spectrum of promising sample-based search algorithms. We don't
need to restrict ourselves to a tabular representation: state
abstraction (for example, using pattern features) provides a way to
generalise between positions and learn more efficiently. Nor do we
need to restrict our learning algorithm to Monte-Carlo: any online
reinforcement learning algorithm can be applied to learn from
simulated experience.
In classical game programming terms, we can think of this approach as
dynamically learning an evaluation function. Instead of learning a
single evaluation function that applies to all positions, we tailor
the evaluation function to the current position, by learning from
simulated experience. Rather than thinking of learning as an offline
procedure that takes months to train weights, let's think of it as an
online procedure that can run in a few seconds. Once we learn the
evaluation function, it can be used in any way we please - for
example alpha-beta search (RLGO 2.1 uses a full width 5-ply search)
or as a heuristic function for UCT (current research in MoGo).
I would be very interested to hear more about Dimwit's approach, and
also Remi's experiments with online learning in CrazyStone.
-Dave
PS Apologies if you receive this twice, I had some difficulties posting.
Rémi:
John and I are planning a similar usage of patterns and features in
the MC simulations, although we were thinking of a very different
method to update "strengths". Although we derived our formulas from an
extension of the logistic regression instead of Bradley-Terry models,
we arrived at very similar expressions. The main difference is that
what we have been calling "score" seems to be the log of your
"strength", and instead of multiplying strengths when considering
"teams", we just add the scores of different features. We use an
online learning algorithm to adjust the scores during the game, which
severely limits the kind of algorithms we can use, but also allows us
to learn things that apply specifically to the situations that are
appearing on the board in this game.
Now we only have to make dimwit 400 points stronger to show the world
that our approach works. :)
There are many things in the paper that we had never thought of, like
considering the distance to the penultimate move. Most of those things
won't be directly applicable to dimwit, but it gives us many
suggestions of things to try.
Thanks for this important contribution to computer go.
Álvaro.
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/