It makes the policy stronger because it makes it more deterministic. The greedy policy is way stronger than the probability distribution.
Rémi ----- Mail original ----- De: "Detlef Schmicker" <[email protected]> À: [email protected] Envoyé: Dimanche 11 Décembre 2016 11:38:08 Objet: [Computer-go] Some experiences with CNN trained on moves by the winning player I want to share some experience training my policy cnn: As I wondered, why reinforcement learning was so helpful. I trained from the Godod database with only using the moves by the winner of each game. Interestingly the prediction rate of this moves was slightly higher (without training, just taking the previously trained network) than taking into account the moves by both players (53% against 52%) Training on winning player moves did not help a lot, I got a statistical significant improvement of about 20-30ELO. So I still don't understand, why reinforcement should do around 100-200ELO :) Detlef _______________________________________________ Computer-go mailing list [email protected] http://computer-go.org/mailman/listinfo/computer-go _______________________________________________ Computer-go mailing list [email protected] http://computer-go.org/mailman/listinfo/computer-go
