Re: [Computer-go] Zero performance

2017-10-21 Thread Gian-Carlo Pascutto
On 20/10/2017 22:48, fotl...@smart-games.com wrote:
> The paper describes 20 and 40 block networks, but the section on
> comparison says AlphaGo Zero uses 20 blocks. I think your protobuf
> describes a 40 block network. That's a factor of two 

They compared with both, the final 5180 Elo number is for the 40 block
one. For the 20 block one, the numbers stop around 4300 Elo.
See for example:

https://www.reddit.com/r/baduk/comments/77hr3b/elo_table_of_alphago_zero_selfplay_games/

A factor of 2 isn't much, but sure, it seems sensible to start with the
smaller one, given how intractable the problem looks right now.

> Your time looks reasonable when calculating the time to generate the
> 29M games at about 10 seconds per move. This is only the time to
> generate the input data. Do you have an estimate of the additional
> time it takes to do the training? It's probably small in comparison,
> but it might not be.

So far I've assumed that it's zero, because it can happen in parallel
and the time to generate the self-play games dominates. From the revised
hardware estimates, we can also see that the training machines used 64
GPUs, which is a lot smaller than the 1500+ TPU estimate for the
self-play machines.

Training on the GTX 1080 Ti does 4 batches of 32 positions per second.
They use 2048 position batches, and train for 1000 batches before
checkpointing. So the GTX can produce a checkpoint every 4.5 hours [1].
Testing that over 400 games takes 8.5 days (400 x 200 x 9.3s).

So again, it totally bottlenecks on playing games, not on training. At
least, if the improvement is big, one needn't play the 400 games out,
but SPRT termination can be used.

[1] To be honest, this seems very fast - even starting from 0 such a big
network barely advances in 1000 iterations (or I misinterpreted a
training parameter). But I guess it's important to have a very fast -
learn knowledge - use new knowledge - feedback cycle.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Zero performance

2017-10-21 Thread Gian-Carlo Pascutto
On 20/10/2017 22:41, Sorin Gherman wrote:
> Training of AlphaGo Zero has been done on thousands of TPUs,
> according to this source: 
> https://www.reddit.com/r/baduk/comments/777ym4/alphago_zero_learning_from_scratch_deepmind/dokj1uz/?context=3
>
>  Maybe that should explain the difference in orders of magnitude that
> you noticed?

That would make a lot more sense, for sure. It would also explain the
25M USD number from Hassabis. That would be a lot of money to spend on
"only" 64 GPUs, or 4 TPU (which are supposed to be ~1 GPU).

There's no explanation where the number came from, but it seems that he
did similar math as in the original post here.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Zero performance

2017-10-20 Thread Gian-Carlo Pascutto
I agree. Even on 19x19 you can use smaller searches. 400 iterations MCTS is
probably already a lot stronger than the raw network, especially if you are
expanding every node (very different from a normal program at 400
playouts!). Some tuning of these mini searches is important. Surely you
don't want to explore every child node for the first play urgency... I
remember this little algorithmic detail was missing from the first paper as
well.

So that's a factor 32 gain. Because the network is smaller, it should learn
much faster too. Someone on reddit posted a comparison of 20 blocks vs 40
blocks.

With 10 people you can probably get some results in a few months. The
question is, how much Elo have we lost on the way...

Another advantage would be that, as long as you keep all the SGF, you can
bootstrap a bigger network from the data! So, nothing is lost from starting
small. You can "upgrade" if the improvements start to plateau.

On Fri, Oct 20, 2017, 23:32 Álvaro Begué  wrote:

> I suggest scaling down the problem until some experience is gained.
>
> You don't need the full-fledge 40-block network to get started. You can
> probably get away with using only 20 blocks and maybe 128 features (from
> 256). That should save you about a factor of 8, plus you can use larger
> mini-batches.
>
> You can also start with 9x9 go. That way games are shorter, and you
> probably don't need 1600 network evaluations per move to do well.
>
> Álvaro.
>
>
> On Fri, Oct 20, 2017 at 1:44 PM, Gian-Carlo Pascutto 
> wrote:
>
>> I reconstructed the full AlphaGo Zero network in Caffe:
>> https://sjeng.org/dl/zero.prototxt
>>
>> I did some performance measurements, with what should be
>> state-of-the-art on consumer hardware:
>>
>> GTX 1080 Ti
>> NVIDIA-Caffe + CUDA 9 + cuDNN 7
>> batch size = 8
>>
>> Memory use is about ~2G. (It's much more for learning, the original
>> minibatch size of 32 wouldn't fit on this card!)
>>
>> Running 2000 iterations takes 93 seconds.
>>
>> In the AlphaGo paper, they claim 0.4 seconds to do 1600 MCTS
>> simulations, and they expand 1 node per visit (if I got it right) so
>> that would be 1600 network evaluations as well, or 200 of my iterations.
>>
>> So it would take me ~9.3s to produce a self-play move, compared to 0.4s
>> for them.
>>
>> I would like to extrapolate how long it will take to reproduce the
>> research, but I think I'm missing how many GPUs are in each self-play
>> worker (4 TPU or 64 GPU or ?), or perhaps the average length of the games.
>>
>> Let's say the latter is around 200 moves. They generated 29 million
>> games for the final result, which means it's going to take me about 1700
>> years to replicate this. I initially estimated 7 years based on the
>> reported 64 GPU vs 1 GPU, but this seems far worse. Did I miss anything
>> in the calculations above, or was it really a *pile* of those 64 GPU
>> machines?
>>
>> Because the performance on playing seems reasonable (you would be able
>> to actually run the MCTS on a consumer machine, and hence end up with a
>> strong program), I would be interested in setting up a distributed
>> effort for this. But realistically there will be maybe 10 people
>> joining, 80 if we're very lucky (looking at Stockfish numbers). That
>> means it'd still take 20 to 170 years.
>>
>> Someone please tell me I missed a factor of 100 or more somewhere. I'd
>> love to be wrong here.
>>
>
>> --
>> GCP
>
>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go

-- 

GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Zero performance

2017-10-20 Thread John Tromp
> You can also start with 9x9 go. That way games are shorter, and you probably
> don't need 1600 network evaluations per move to do well.

Bonus points if you can have it play on goquest where many
of us can enjoy watching its progress, or even challenge it...

regards,
-John
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Zero performance

2017-10-20 Thread fotland
The paper describes 20 and 40 block networks, but the section on comparison 
says AlphaGo Zero uses 20 blocks. I think your protobuf describes a 40 block 
network. That's a factor of two 

If you only want pro strength rather than superhuman, you can train for half 
their time.

Your time looks reasonable when calculating the time to generate the 29M games 
at about 10 seconds per move. This is only the time to generate the input data. 
Do you have an estimate of the additional time it takes to do the training? 
It's probably small in comparison, but it might not be.

My plan is to start out with a little supervised learning, since I'm not trying 
to prove a breakthrough. I experimented last year for a few months with 
res-nets for a policy network and there are some things I discovered there that 
probably apply to this network. They should get perhaps a factor of 5 to 10 
speedup. For a commercial program I'll be happy with 7-dan amateur with about 6 
months of training using my two GPUs and sixteen i7 cores. 

David

-Original Message-
From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Gian-Carlo Pascutto
Sent: Friday, October 20, 2017 10:45 AM
To: computer-go@computer-go.org
Subject: [Computer-go] Zero performance

I reconstructed the full AlphaGo Zero network in Caffe:
https://sjeng.org/dl/zero.prototxt

I did some performance measurements, with what should be state-of-the-art on 
consumer hardware:

GTX 1080 Ti
NVIDIA-Caffe + CUDA 9 + cuDNN 7
batch size = 8

Memory use is about ~2G. (It's much more for learning, the original minibatch 
size of 32 wouldn't fit on this card!)

Running 2000 iterations takes 93 seconds.

In the AlphaGo paper, they claim 0.4 seconds to do 1600 MCTS simulations, and 
they expand 1 node per visit (if I got it right) so that would be 1600 network 
evaluations as well, or 200 of my iterations.

So it would take me ~9.3s to produce a self-play move, compared to 0.4s for 
them.

I would like to extrapolate how long it will take to reproduce the research, 
but I think I'm missing how many GPUs are in each self-play worker (4 TPU or 64 
GPU or ?), or perhaps the average length of the games.

Let's say the latter is around 200 moves. They generated 29 million games for 
the final result, which means it's going to take me about 1700 years to 
replicate this. I initially estimated 7 years based on the reported 64 GPU vs 1 
GPU, but this seems far worse. Did I miss anything in the calculations above, 
or was it really a *pile* of those 64 GPU machines?

Because the performance on playing seems reasonable (you would be able to 
actually run the MCTS on a consumer machine, and hence end up with a strong 
program), I would be interested in setting up a distributed effort for this. 
But realistically there will be maybe 10 people joining, 80 if we're very lucky 
(looking at Stockfish numbers). That means it'd still take 20 to 170 years.

Someone please tell me I missed a factor of 100 or more somewhere. I'd love to 
be wrong here.

--
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Zero performance

2017-10-20 Thread Sorin Gherman
Training of AlphaGo Zero has been done on thousands of TPUs, according to
this source:
https://www.reddit.com/r/baduk/comments/777ym4/alphago_zero_learning_from_scratch_deepmind/dokj1uz/?context=3

Maybe that should explain the difference in orders of magnitude that you
noticed?


On Fri, Oct 20, 2017 at 10:44 AM, Gian-Carlo Pascutto  wrote:

> I reconstructed the full AlphaGo Zero network in Caffe:
> https://sjeng.org/dl/zero.prototxt
>
> I did some performance measurements, with what should be
> state-of-the-art on consumer hardware:
>
> GTX 1080 Ti
> NVIDIA-Caffe + CUDA 9 + cuDNN 7
> batch size = 8
>
> Memory use is about ~2G. (It's much more for learning, the original
> minibatch size of 32 wouldn't fit on this card!)
>
> Running 2000 iterations takes 93 seconds.
>
> In the AlphaGo paper, they claim 0.4 seconds to do 1600 MCTS
> simulations, and they expand 1 node per visit (if I got it right) so
> that would be 1600 network evaluations as well, or 200 of my iterations.
>
> So it would take me ~9.3s to produce a self-play move, compared to 0.4s
> for them.
>
> I would like to extrapolate how long it will take to reproduce the
> research, but I think I'm missing how many GPUs are in each self-play
> worker (4 TPU or 64 GPU or ?), or perhaps the average length of the games.
>
> Let's say the latter is around 200 moves. They generated 29 million
> games for the final result, which means it's going to take me about 1700
> years to replicate this. I initially estimated 7 years based on the
> reported 64 GPU vs 1 GPU, but this seems far worse. Did I miss anything
> in the calculations above, or was it really a *pile* of those 64 GPU
> machines?
>
> Because the performance on playing seems reasonable (you would be able
> to actually run the MCTS on a consumer machine, and hence end up with a
> strong program), I would be interested in setting up a distributed
> effort for this. But realistically there will be maybe 10 people
> joining, 80 if we're very lucky (looking at Stockfish numbers). That
> means it'd still take 20 to 170 years.
>
> Someone please tell me I missed a factor of 100 or more somewhere. I'd
> love to be wrong here.
>
> --
> GCP
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Zero performance

2017-10-20 Thread Álvaro Begué
I suggest scaling down the problem until some experience is gained.

You don't need the full-fledge 40-block network to get started. You can
probably get away with using only 20 blocks and maybe 128 features (from
256). That should save you about a factor of 8, plus you can use larger
mini-batches.

You can also start with 9x9 go. That way games are shorter, and you
probably don't need 1600 network evaluations per move to do well.

Álvaro.


On Fri, Oct 20, 2017 at 1:44 PM, Gian-Carlo Pascutto  wrote:

> I reconstructed the full AlphaGo Zero network in Caffe:
> https://sjeng.org/dl/zero.prototxt
>
> I did some performance measurements, with what should be
> state-of-the-art on consumer hardware:
>
> GTX 1080 Ti
> NVIDIA-Caffe + CUDA 9 + cuDNN 7
> batch size = 8
>
> Memory use is about ~2G. (It's much more for learning, the original
> minibatch size of 32 wouldn't fit on this card!)
>
> Running 2000 iterations takes 93 seconds.
>
> In the AlphaGo paper, they claim 0.4 seconds to do 1600 MCTS
> simulations, and they expand 1 node per visit (if I got it right) so
> that would be 1600 network evaluations as well, or 200 of my iterations.
>
> So it would take me ~9.3s to produce a self-play move, compared to 0.4s
> for them.
>
> I would like to extrapolate how long it will take to reproduce the
> research, but I think I'm missing how many GPUs are in each self-play
> worker (4 TPU or 64 GPU or ?), or perhaps the average length of the games.
>
> Let's say the latter is around 200 moves. They generated 29 million
> games for the final result, which means it's going to take me about 1700
> years to replicate this. I initially estimated 7 years based on the
> reported 64 GPU vs 1 GPU, but this seems far worse. Did I miss anything
> in the calculations above, or was it really a *pile* of those 64 GPU
> machines?
>
> Because the performance on playing seems reasonable (you would be able
> to actually run the MCTS on a consumer machine, and hence end up with a
> strong program), I would be interested in setting up a distributed
> effort for this. But realistically there will be maybe 10 people
> joining, 80 if we're very lucky (looking at Stockfish numbers). That
> means it'd still take 20 to 170 years.
>
> Someone please tell me I missed a factor of 100 or more somewhere. I'd
> love to be wrong here.
>
> --
> GCP
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go