On 03-12-17 17:57, Rémi Coulom wrote: > They have a Q(s,a) term in their node-selection formula, but they > don't tell what value they give to an action that has not yet been > visited. Maybe Aja can tell us.
FWIW I already asked Aja this exact question a bit after the paper came out and he told me he cannot answer questions about unpublished details. This is not very promising regarding reproducibility considering the AZ paper is even lighter on them. Another issue which is up in the air is whether the choice of the number of playouts for the MCTS part represents an implicit balancing between self-play and training speed. This is particularly relevant if the evaluation step is removed. But it's possible even DeepMind doesn't know the answer for sure. They had a setup, and they optimized it. It's not clear which parts generalize. (Usually one wonders about such things in terms of algorithms, but here one wonders about it in terms of hardware!) -- GCP _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go