Congratulations, people at DeepMind! Your paper is very interesting to read.
I have a question about the paper. On policy network training it says > On the first pass through the training pipeline, the baseline was set to zero; on the second pass we used the value network vθ(s) as a baseline; but I cannot find any other description about the "second pass". What is it? It uses vθ(s), so at least it is done after training vθ(s). Is it that after completing the whole training pipeline depicted in Fig. 1, only the RL policy network training part is repeated? Or training vθ(s) is also repeated? Is the second pass the last pass, or there are more passes? Sorry if I just missed the relevant part of the paper. 2016-02-13 12:21 GMT+09:00 John Tromp <john.tr...@gmail.com>: > On Wed, Jan 27, 2016 at 1:46 PM, Aja Huang <ajahu...@google.com> wrote: > > We are very excited to announce that our Go program, AlphaGo, has beaten > a > > professional player for the first time. AlphaGo beat the European > champion > > Fan Hui by 5 games to 0. > > It's interesting to go back nearly a decade and read this 2007 article: > > http://spectrum.ieee.org/computing/software/cracking-go > > where Feng-Hsiung Hsu, Deep Blue's lead developer, made this prediction: > > "Nevertheless, I believe that a world-champion-level Go machine can be > built within 10 years" > > Which now appears to be spot on. March 9 cannot come soon enough... > The remainder of his prediction rings less true though: > > ", based on the same method of intensive analysis—brute force, > basically—that Deep Blue employed for chess". > > regards, > -John > _______________________________________________ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go >
_______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go