Re: [Computer-go] AGZ Policy Head

2017-12-29 Thread Brian Sheppard via Computer-go
I agree that having special knowledge for "pass" is not a big compromise, but 
it would not meet the "zero knowledge" goal, no?

-Original Message-
From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Rémi Coulom
Sent: Friday, December 29, 2017 7:50 AM
To: computer-go@computer-go.org
Subject: Re: [Computer-go] AGZ Policy Head

I also wonder about this. A purely convolutional approach would save a lot of 
weights. The output for pass can be set to be a single bias parameter, 
connected to nothing. Setting pass to a constant might work, too. I don't 
understand the reason for such a complication.

- Mail original -
De: "Andy" <andy.olsen...@gmail.com>
À: "computer-go" <computer-go@computer-go.org>
Envoyé: Vendredi 29 Décembre 2017 06:47:06
Objet: [Computer-go] AGZ Policy Head



Is there some particular reason AGZ uses two 1x1 filters for the policy head 
instead of one? 


They could also have allowed more, but I guess that would be expensive? I 
calculate that the fully connected layer has 2*361*362 weights, where 2 is the 
number of filters. 


By comparison the value head has only a single 1x1 filter, but it goes to a 
hidden layer of 256. That gives 1*361*256 weights. Why not use two 1x1 filters 
here? Maybe since the final output is only a single scalar it's not needed? 










___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] AGZ Policy Head

2017-12-29 Thread David Wu
As far as a purely convolutional approach, I think you *can* do better by
adding some global connectivity.

Generally speaking, there should be some value in global connectivity for
things like upweighting the probability of playing ko threats anywhere on
the board when there is an active ko anywhere else on the board. If you
made the whole neural net purely convolutional, then of course with enough
convolutional layers the neural net could still learn to distribute this
the "there is an important ko on the board" property everywhere, but it
would take more many layers.

I've actually experimented with this recently in training my own policy net
- for example one approach is to have an special residual block just before
the policy head:
* Compute a convolution (1x1 or 3x3) of the trunk with C channels for a
small C, result shape 19x19xC.
* Average-pool the results down to 1x1xC.
* Multiply by CxN matrix to turn that into 1x1xN where N is the number of
channels in the main trunk of the resnet, broadcast up to 19x19xN, and add
back into the main trunk (e.g. skip connection).
Apply your favorite activation function at appropriate points in the above.

There are other possible architectures for this block too, I actually did
something a bit more complicated but still pretty similar. Anyways, it
turns out that when I visualize the activations on example game situations,
I find the that the neural net actually does use one of the C channels for
"is there a ko fight" which makes it predict ko threats elsewhere on the
board! Some of the other average-pooled channels appear to be used for
things like detecting game phase (how full is the board?), and detecting
who is ahead (perhaps to decide when to play risky or safe - it's
interesting that the neural net has decided this is important given that
it's a pure policy net and is trained to predict only moves, not values).

Anyways, for AGZ's case, it seems weird to only have 2 filters feeding into
the fully connected, that seems like too few to encode much useful logic
like this. I'm also mystified at this architecture.


On Fri, Dec 29, 2017 at 7:50 AM, Rémi Coulom  wrote:

> I also wonder about this. A purely convolutional approach would save a lot
> of weights. The output for pass can be set to be a single bias parameter,
> connected to nothing. Setting pass to a constant might work, too. I don't
> understand the reason for such a complication.
>
> - Mail original -
> De: "Andy" 
> À: "computer-go" 
> Envoyé: Vendredi 29 Décembre 2017 06:47:06
> Objet: [Computer-go] AGZ Policy Head
>
>
>
> Is there some particular reason AGZ uses two 1x1 filters for the policy
> head instead of one?
>
>
> They could also have allowed more, but I guess that would be expensive? I
> calculate that the fully connected layer has 2*361*362 weights, where 2 is
> the number of filters.
>
>
> By comparison the value head has only a single 1x1 filter, but it goes to
> a hidden layer of 256. That gives 1*361*256 weights. Why not use two 1x1
> filters here? Maybe since the final output is only a single scalar it's not
> needed?
>
>
>
>
>
>
>
>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] AGZ Policy Head

2017-12-29 Thread Rémi Coulom
I also wonder about this. A purely convolutional approach would save a lot of 
weights. The output for pass can be set to be a single bias parameter, 
connected to nothing. Setting pass to a constant might work, too. I don't 
understand the reason for such a complication.

- Mail original -
De: "Andy" 
À: "computer-go" 
Envoyé: Vendredi 29 Décembre 2017 06:47:06
Objet: [Computer-go] AGZ Policy Head



Is there some particular reason AGZ uses two 1x1 filters for the policy head 
instead of one? 


They could also have allowed more, but I guess that would be expensive? I 
calculate that the fully connected layer has 2*361*362 weights, where 2 is the 
number of filters. 


By comparison the value head has only a single 1x1 filter, but it goes to a 
hidden layer of 256. That gives 1*361*256 weights. Why not use two 1x1 filters 
here? Maybe since the final output is only a single scalar it's not needed? 










___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go