Re: [Computer-go] The heuristic "last good reply"

Peter Drake Tue, 25 Jan 2011 13:49:31 -0800

I'm all for a learning policy, if you can figure out how to do it. :-)


Peter Drake
http://www.lclark.edu/~drake/



On Jan 25, 2011, at 11:31 AM, Aja wrote:

Hi Professor Drake,

I will try with more playouts. Thanks for your reminding.
I give an example to show my view: default policy should also beincluded to learn. I suppose: if there are several decisive life-and-death or semeai in a position, the tree search cannot go to/clarifyevery one of them.
In this example, Black's L2 and L4 will cause White to play L3 tocapture by default policy (it's completely bad). Then Black maylearn quickly by "last good play" to atari immediately and killWhite's whole group to win. The problem is, White is not able tolearn the correct answer H1 or H2 because it is fixed in defaultpolicy.
In the playouts, the configuration of such a big semeai might bevery similar. Such evaluation bias is exactly an issue that we canfix by learning. By considering probability, I can fix this problemby increasing the probability of the "last good reply" H1 or H2,without tree's aid.
Every program's implementation of the playout is more or lessdifferent. But I think excluding default policy from the learningmight limit the full power of "last good reply".
Aja

----- Original Message -----
From: Peter Drake
To: computer-go@dvandva.org
Sent: Wednesday, January 26, 2011 2:27 AM
Subject: Re: [Computer-go] The heuristic "last good reply"

On Jan 25, 2011, at 10:19 AM, Aja wrote:
Dear all,
Today I have tried Professor Drake's "last good reply" in Erica. Sofar, I got at most 20-30 elo from it.
I tested by self-play, with 3000 playouts/move on 19x19. The amountof playouts might be too few, but I would like to test moreplayouts IF the playing strength is not weaker with 3000 playouts.
Yes -- the smallest experiments in the paper were with 8k playoutsper move. There may not be time to fill up the reply tables withonly 3k.
From this preliminary experiments with 3000 playouts, I have someobservations:
1. In Erica, it's better to consider probability for this heuristic.
2. In Prof. Drake's implementation, there is a weakness inlearning. I think the main problem is that for a reply which isdeterministically played by default policy, there is no room tolearn a new reply. For example, if "save by capture" produces alost game, then in the next simulation, it will still play "save bycapture" by default policy. If I am wrong in this point, I am gladto be corrected by anyone.
This is true, but only if the previous move (or previous two moves)come up again in exactly the same board configuration. When theconfiguration is exactly the same, we are probably still in thesearch tree, which overrides the policy. If we are beyond the tree,the configuration is almost always different.
3. This heuristic has potential to perform better in Erica. I hopethis brief result would encourage other authors to try it.
It's reassuring to see that you got some strength improvement out ofit!
Thanks,

Peter Drake
http://www.lclark.edu/~drake/





_______________________________________________
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
<default_policy.sgf>

_______________________________________________
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] The heuristic "last good reply"

Reply via email to