I also expected bootstrapping by self-play. (I also wrote a post to that 
effect. But of course, DeepMind actually DID IT.)

But I didn't envision any of the other stuff. This is why I love their papers. 
Papers from most sources are predictable, skimpy, and sketchy, but theirs 
contain all sorts of deep insights that I never saw coming. And the theory, 
architecture, implementation, and explanation are all first-rate. It's like the 
Poker papers from U Alberta, or the source code for Stockfish. Lessons on every 

Regarding Elo deltas: the length of Go games obscures what might be very small 
differences. E.g., if one player's moves are 3% more likely to be game-losing 
errors, then won't that player lose nearly every game? That is, 3 more 
"blunders" per game.

Regarding these details: at some level, all of these *must* be artifacts of 
training. That is, the NN architectures that did "badly" are still 
asymptotically optimal, so they should also eventually play equally well, 
provided that training continues indefinitely, and the network are large 
enough, and parameters do not freeze prematurely, and training eventually uses 
only self-play data, etc. I believe that is mathematically accurate, so I would 
ask a different question: why do those choices make better use of resources in 
the short run?

I have no idea; I'm just asking.

-----Original Message-----
From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Gian-Carlo Pascutto
Sent: Wednesday, October 18, 2017 5:40 PM
To: computer-go@computer-go.org
Subject: Re: [Computer-go] AlphaGo Zero

On 18/10/2017 22:00, Brian Sheppard via Computer-go wrote:
> This paper is required reading. When I read this team’s papers, I 
> think to myself “Wow, this is brilliant! And I think I see the next step.”
> When I read their next paper, they show me the next *three* steps.

Hmm, interesting way of seeing it. Once they had Lee Sedol AlphaGo, it was 
somewhat obvious that just self-playing that should lead to an improved policy 
and value net.

And before someone accuses me of Captain Hindsighting here, this was pointed 
out on this list:

It looks to me like the real devil is in the details. Don't use a residual 
stack? -600 Elo. Don't combine the networks? -600 Elo.
Bootstrap the learning? -300 Elo

We made 3 perfectly reasonable choices and somehow lost 1500 Elo along the way. 
I can't get over that number, actually.

Getting the details right makes a difference. And they're getting them right, 
either because they're smart, because of experience from other domains, or 
because they're trying a ton of them. I'm betting on all 3.

Computer-go mailing list

Computer-go mailing list

Reply via email to