On 12/02/2017 5:44, Álvaro Begué wrote: > I thought about this for about an hour this morning, and this is what I > came up with. You could make a database of positions with a label > indicating the result (perhaps from real games, perhaps similarly to how > AlphaGo trained their value network). Loop over the positions, run a few > playouts and tweak the move probabilities by some sort of reinforcement > learning, where you promote the move choices from playouts whose outcome > matches the label, and you discourage the move choices from playouts > whose outcome does not match the label. > > The point is that we would be pushing our playout policy to produce good > estimates of the result of the game, which in the end is what playout > policies are for. > > Any thoughts? Did anyone actually try something like this?
This is how Facebook trained the playout policy of Darkforest. I couldn't tell from the paper, but inspecting the code shows exactly this algorithm at work. -- GCP _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go