Congratulations, people at DeepMind! Your paper is very interesting to read.

I have a question about the paper. On policy network training it says

> On the first pass through the training pipeline, the baseline was set to
zero; on the second pass we used the value network vθ(s) as a baseline;

but I cannot find any other description about the "second pass". What is
it? It uses vθ(s), so at least it is done after training vθ(s). Is it that
after completing the whole training pipeline depicted in Fig. 1, only the
RL policy network training part is repeated? Or training vθ(s) is also
repeated? Is the second pass the last pass, or there are more passes? Sorry
if I just missed the relevant part of the paper.


2016-02-13 12:21 GMT+09:00 John Tromp <john.tr...@gmail.com>:

> On Wed, Jan 27, 2016 at 1:46 PM, Aja Huang <ajahu...@google.com> wrote:
> > We are very excited to announce that our Go program, AlphaGo, has beaten
> a
> > professional player for the first time. AlphaGo beat the European
> champion
> > Fan Hui by 5 games to 0.
>
> It's interesting to go back nearly a decade and read this 2007 article:
>
> http://spectrum.ieee.org/computing/software/cracking-go
>
> where Feng-Hsiung Hsu, Deep Blue's lead developer, made this prediction:
>
> "Nevertheless, I believe that a world-champion-level Go machine can be
> built within 10 years"
>
> Which now appears to be spot on. March 9 cannot come soon enough...
> The remainder of his prediction rings less true though:
>
> ", based on the same method of intensive analysis—brute force,
> basically—that Deep Blue employed for chess".
>
> regards,
> -John
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to