On 12/02/2017 5:44, Álvaro Begué wrote:

> I thought about this for about an hour this morning, and this is what I
> came up with. You could make a database of positions with a label
> indicating the result (perhaps from real games, perhaps similarly to how
> AlphaGo trained their value network). Loop over the positions, run a few
> playouts and tweak the move probabilities by some sort of reinforcement
> learning, where you promote the move choices from playouts whose outcome
> matches the label, and you discourage the move choices from playouts
> whose outcome does not match the label.
> 
> The point is that we would be pushing our playout policy to produce good
> estimates of the result of the game, which in the end is what playout
> policies are for.
> 
> Any thoughts? Did anyone actually try something like this?

This is how Facebook trained the playout policy of Darkforest. I
couldn't tell from the paper, but inspecting the code shows exactly this
algorithm at work.

-- 
GCP
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to