[computer-go] Re: Amsterdam 2007 paper

David Silver Mon, 21 May 2007 09:39:39 -0700

I also use an online learning algorithm in RLGO to adjust feature

weights during the game. I use around a million features (allpossible

patterns from 1x1 up to 3x3 at all locations on the board) and update
the weights online from simulated games using temporal difference
learning. I also use the sum of feature weights to estimate the value
of a move, rather than a multiplicative estimate. The learning signal

is only win/lose at the end of each simulation, rather thansupervised

learning like Remi. The results are encouraging (currently ~1820 Elo
on CGOS, based on 5000 simulations per move) for a program that does
not use UCT or Monte-Carlo Tree Search in any way.


This is impressive!   Does your program use an alpha/beta tree search?
I'm not clear on how it selects a move.

During simulation, I use an epsilon-greedy policy, (i.e. 1-ply searchwith the occasional random move thrown in for exploration). Thisupdates the evaluation function online to reflect the currentsituation. For actual move selection, I use a 5-ply alpha-beta search(with no exploration), using the updated evaluation function.Interestingly, the alpha-beta search doesn't help as much as Iexpected, and in fact hurts performance for 2-3 ply searches.



_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Re: Amsterdam 2007 paper

Reply via email to