Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-12 Thread Xavier Combelle
So I will start to create software, and if someone want to use it you
will be free as free software, and I already found someone

who is ready to host the server side.

From a practical point of view, I will use public key signing to
distribute go software (binary or source), so I will ask the author to
sign it and give me their public key.

Xavier Combelle


Le 12/01/2017 à 11:04, Gian-Carlo Pascutto a écrit :
> On 11-01-17 18:09, Xavier Combelle wrote:
>> Of course it means distribute at least the binary so, or the source,
>> so proprietary software could be reluctant to share it. But for free
>> software there should not any problem. If someone is interested by my
>> proposition, I would be pleased to realize it.
> It is obvious that having a 30M dataset of games between strong players
> (i.e. replicating the AlphaGo training set) would be beneficial to the
> community. It is clear that most of us are trying to do the same now,
> that is somehow trying to learn a value function from the about ~1.5M
> KGS+Tygen+GoGoD games while trying to control overfitting via various
> measures. (Aya used small network + dropout. Rn trained multiple outputs
> on a network of unknown size. I wonder why no-one tried normal L1/L2
> regularization, but then I again I didn't get that working either!)
>
> Software should also not really be a problem: Leela is free, Ray and
> Darkforest are open source. If we can use a pure DCNN player I think
> there are several more options, for example I've seen several programs
> in Python. You can resolve score disagreement by invoking GNU Go --score
> aftermath.
>
> I think it's an open question though, *how* the games should be
> generated, i.e.:
>
> * Follow AlphaGo procedure but with SL instead of RL player (you can use
> bigger or smaller networks too, many tradeoffs possible)
> * Play games with full MCTS search and small number of playouts. (More
> bias, much higher quality games).
> * The author of Aya also stated his procedure.
> * Several of those and mix :-)
>



0xFA1051C4.asc
Description: application/pgp-keys


signature.asc
Description: OpenPGP digital signature
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-12 Thread Gian-Carlo Pascutto
On 11-01-17 18:09, Xavier Combelle wrote:
> Of course it means distribute at least the binary so, or the source,
> so proprietary software could be reluctant to share it. But for free
> software there should not any problem. If someone is interested by my
> proposition, I would be pleased to realize it.

It is obvious that having a 30M dataset of games between strong players
(i.e. replicating the AlphaGo training set) would be beneficial to the
community. It is clear that most of us are trying to do the same now,
that is somehow trying to learn a value function from the about ~1.5M
KGS+Tygen+GoGoD games while trying to control overfitting via various
measures. (Aya used small network + dropout. Rn trained multiple outputs
on a network of unknown size. I wonder why no-one tried normal L1/L2
regularization, but then I again I didn't get that working either!)

Software should also not really be a problem: Leela is free, Ray and
Darkforest are open source. If we can use a pure DCNN player I think
there are several more options, for example I've seen several programs
in Python. You can resolve score disagreement by invoking GNU Go --score
aftermath.

I think it's an open question though, *how* the games should be
generated, i.e.:

* Follow AlphaGo procedure but with SL instead of RL player (you can use
bigger or smaller networks too, many tradeoffs possible)
* Play games with full MCTS search and small number of playouts. (More
bias, much higher quality games).
* The author of Aya also stated his procedure.
* Several of those and mix :-)

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Xavier Combelle


Le 11/01/2017 à 16:14, Bo Peng a écrit :
> Hi,
>
>> How do you get the V(s) for those datasets? You play out the endgame
>> with the Monte Carlo playouts?
>>
>> I think one problem with this approach is that errors in the data for
>> V(s) directly correlate to errors in MC playouts. So a large benefit of
>> "mixing" the two (otherwise independent) evaluations is lost.
> Yes, that is a problem for Human games dataset.
>
> On the other hand, currently the SL part is relatively easier (it seems
> everyone arrives at a 50-60% accuracy), and the main challenges of the RL
> part is generating the huge number of self-play games.
>
> In self-play games we have an accurate end-game v(s) / V(s). And v(s) /
> V(s) is able to use the information in self-play games more efficiently. I
> think this can be helpful.
>
Could, some distributed workload such as fishtest for stockfish help to
generate
huge number of self-play game

If it is the case I could create the framework to use it. It is
classical programming and as such
I should be able to do it (at the opposite of Computer go software which
is hard for me by lack of practice).
Of course it means distribute at least the binary so, or the source, so
proprietary software could be reluctant to share it.
But for free software there should not any problem.

If someone is interested by my proposition, I would be pleased to
realize it.

Xavier

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
Hi,

>How do you get the V(s) for those datasets? You play out the endgame
>with the Monte Carlo playouts?
>
>I think one problem with this approach is that errors in the data for
>V(s) directly correlate to errors in MC playouts. So a large benefit of
>"mixing" the two (otherwise independent) evaluations is lost.

Yes, that is a problem for Human games dataset.

On the other hand, currently the SL part is relatively easier (it seems
everyone arrives at a 50-60% accuracy), and the main challenges of the RL
part is generating the huge number of self-play games.

In self-play games we have an accurate end-game v(s) / V(s). And v(s) /
V(s) is able to use the information in self-play games more efficiently. I
think this can be helpful.
>


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Kensuke Matsuzaki
Hi,

How do you get the V(s) for those datasets? You play out the endgame
> with the Monte Carlo playouts?
>

Yes, I use result of 100 playout from the endgame.
Sometimes the result stored in sgf differs from result of playouts.

zakki
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
Hi zakki,

> I couldn't get positive experiment results on Ray.
> Rn's network structure of V and W are similar and share parameters,
> but only final convolutional layer are different.
> I trained Rn's network to minimize MSE of V(s) + W(s).
> It uses only KGS and GoGoD data sets, no self play with RL policy.


Thanks for sharing your results.

Have you tried more stages of training V, in which the second method in my
PDF is also used (i.e. Train the value network to fit the "observed move",
as I feel it could improve the "awareness / sharpness" of V).

Bo


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
It¹s nice to see so many discussions.

Another reason could be that training a good quality v(s) (or V(s)) may
require some different network structures from that of W(s).

Usually it is helpful to have an ensemble of different networks, each
constructed from different principles.

On 1/11/17, 22:19, "Computer-go on behalf of Gian-Carlo Pascutto"
 wrote:
>
>Combining this with Kensuke's comment, I think it might be worth trying
>to train V(s) and W(s) simultaneously, but with V(s) being the linear
>interpolation depending on move number, not the value function (which
>leaves us without a way to play handicap games and a bunch of other
>benefits).
>
>This could reduce overfitting during training, and if we only use W(s)
>during gameplay we still have the "strong signal" advantage.
>
>-- 
>GCP
>___
>Computer-go mailing list
>Computer-go@computer-go.org
>http://computer-go.org/mailman/listinfo/computer-go


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Gian-Carlo Pascutto
On 10-01-17 23:25, Bo Peng wrote:
> Hi everyone. It occurs to me there might be a more efficient method to
> train the value network directly (without using the policy network).
> 
> You are welcome to check my
> method: http://withablink.com/GoValueFunction.pdf
> 

For Method 1 you state:

"However, because v is an finer function than V (which is already finer
than W), the bias is better controlled than the case of W, and we can
use all states in the game to train our network, instead of just picking
1 state in each game to avoid over-fitting"

This is intuitively true, and I'm sure it will reduce some overfitting
behavior, but empirically the author of Aya reported the opposite, i.e.
training on W/L is superior over a linear interpolation to the endgame.

It's possible this happens because the V(s) flipping from 0.5 to 0 and 1
more steeply helps the positions where this happens stand out from the
MC noise.

Combining this with Kensuke's comment, I think it might be worth trying
to train V(s) and W(s) simultaneously, but with V(s) being the linear
interpolation depending on move number, not the value function (which
leaves us without a way to play handicap games and a bunch of other
benefits).

This could reduce overfitting during training, and if we only use W(s)
during gameplay we still have the "strong signal" advantage.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Gian-Carlo Pascutto
On 11-01-17 14:33, Kensuke Matsuzaki wrote:
> Hi,
> 
> I couldn't get positive experiment results on Ray.
>  
> Rn's network structure of V and W are similar and share parameters,
> but only final convolutional layer are different.
> I trained Rn's network to minimize MSE of V(s) + W(s).
> It uses only KGS and GoGoD data sets, no self play with RL policy.

How do you get the V(s) for those datasets? You play out the endgame
with the Monte Carlo playouts?

I think one problem with this approach is that errors in the data for
V(s) directly correlate to errors in MC playouts. So a large benefit of
"mixing" the two (otherwise independent) evaluations is lost.

This problem doesn't exist when using raw W/L data from those datasets,
or when using SL/RL playouts. (But note that using the full engine to
produce games *would* suffer from the same correlation. That might be
entirely offset by the higher quality of the data, though.)

> But I have no idea about how to use V(s) or v(s) in MCTS.

V(s) seems potentially useful for handicap games where W(s) is no longer
accurate. I don't see any benefit over W(s) for even games.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Kensuke Matsuzaki
Hi,

I couldn't get positive experiment results on Ray.

Rn's network structure of V and W are similar and share parameters,
but only final convolutional layer are different.
I trained Rn's network to minimize MSE of V(s) + W(s).
It uses only KGS and GoGoD data sets, no self play with RL policy.
When trained only W(s), the network overfits, but to train V(s) + W(s) same
time
prevents overfitting.
But I have no idea about how to use V(s) or v(s) in MCTS.

Rn.3.0-4c plays with W(s): winning rate.
http://www.yss-aya.com/19x19/cgos/cross/Rn.3.0-4c.html
3394 elo

Rn.3.1-4c plays with V(s): sum of ownership. bit weaker
# MCTS part is tuned for W(s) now, so something may be wrong.
http://www.yss-aya.com/cgos/19x19/cross/Rn.3.1-4c.html
3218 elo

zakki

2017年1月11日(水) 19:49 Bo Peng :

> Hi Remi,
>
> Thanks for sharing your experience.
>
> As I am writing this, it seems there could be a third method: the perfect
> value function shall have the minimax property in the obvious way. So we
> can train our value function to satisfy the minimax property as well. In
> fact, we can train it such that a shallow-level MCTS gives as close a
> result as a deeper-level MCTS. This can be regarded as some kind of
> bootstrapping.
>
> Wonder if you have tried this. Seems might be a natural idea...
>
> Bo
>
> On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom"
> 
> wrote:
>
> >Hi,
> >
> >Thanks for sharing your idea.
> >
> >In my experience it is rarely efficient to train value functions from
> >very short term data (ie, next move). TD(lambda), or training from the
> >final outcome of the game is often better, because it uses a longer
> >horizon. But of course, it is difficult to tell without experiments
> >whether your idea would work or not. The advantage of your ideas is that
> >you can collect a lot of training data more easily.
> >
> >Rémi
> >
> >- Mail original -
> >De: "Bo Peng" 
> >À: computer-go@computer-go.org
> >Envoyé: Mardi 10 Janvier 2017 23:25:19
> >Objet: [Computer-go] Training the value network (a possibly more
> >efficient approach)
> >
> >
> >Hi everyone. It occurs to me there might be a more efficient method to
> >train the value network directly (without using the policy network).
> >
> >
> >You are welcome to check my method:
> >http://withablink.com/GoValueFunction.pdf
> >
> >
> >Let me know if there is any silly mistakes :)
> >
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
Hi Remi,

Thanks for sharing your experience.

As I am writing this, it seems there could be a third method: the perfect
value function shall have the minimax property in the obvious way. So we
can train our value function to satisfy the minimax property as well. In
fact, we can train it such that a shallow-level MCTS gives as close a
result as a deeper-level MCTS. This can be regarded as some kind of
bootstrapping.
 
Wonder if you have tried this. Seems might be a natural idea...

Bo

On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom"

wrote:

>Hi,
>
>Thanks for sharing your idea.
>
>In my experience it is rarely efficient to train value functions from
>very short term data (ie, next move). TD(lambda), or training from the
>final outcome of the game is often better, because it uses a longer
>horizon. But of course, it is difficult to tell without experiments
>whether your idea would work or not. The advantage of your ideas is that
>you can collect a lot of training data more easily.
>
>Rémi
>
>- Mail original -
>De: "Bo Peng" 
>À: computer-go@computer-go.org
>Envoyé: Mardi 10 Janvier 2017 23:25:19
>Objet: [Computer-go] Training the value network (a possibly more
>efficient approach)
>
>
>Hi everyone. It occurs to me there might be a more efficient method to
>train the value network directly (without using the policy network).
>
>
>You are welcome to check my method:
>http://withablink.com/GoValueFunction.pdf
>
>
>Let me know if there is any silly mistakes :)
>


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Rémi Coulom
Hi,

Thanks for sharing your idea.

In my experience it is rarely efficient to train value functions from very 
short term data (ie, next move). TD(lambda), or training from the final outcome 
of the game is often better, because it uses a longer horizon. But of course, 
it is difficult to tell without experiments whether your idea would work or 
not. The advantage of your ideas is that you can collect a lot of training data 
more easily.

Rémi

- Mail original -
De: "Bo Peng" 
À: computer-go@computer-go.org
Envoyé: Mardi 10 Janvier 2017 23:25:19
Objet: [Computer-go] Training the value network (a possibly more efficient 
approach)


Hi everyone. It occurs to me there might be a more efficient method to train 
the value network directly (without using the policy network). 


You are welcome to check my method: http://withablink.com/GoValueFunction.pdf 


Let me know if there is any silly mistakes :) 

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-10 Thread Brian Sheppard
I was writing code along those lines when AlphaGo debuted. When it became clear 
that AlphaGo had succeeded, then I ceased work.

 

So I don’t know whether this strategy will succeed, but the theoretical merits 
were good enough to encourage me.

 

Best of luck,

Brian

 

From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of Bo 
Peng
Sent: Tuesday, January 10, 2017 5:25 PM
To: computer-go@computer-go.org
Subject: [Computer-go] Training the value network (a possibly more efficient 
approach)

 

Hi everyone. It occurs to me there might be a more efficient method to train 
the value network directly (without using the policy network).

 

You are welcome to check my method: http://withablink.com/GoValueFunction.pdf

 

Let me know if there is any silly mistakes :)

 

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-10 Thread John Tromp
hi Bo,

> Let me know if there is any silly mistakes :)

You say "the perfect policy network can be
derived from the perfect value network (the best next move is the move
that maximises the value for the player, if the value function is
perfect), but not vice versa.", but a perfect policy for both players
can be used to generate a perfect playout which yields the perfect
value...

regards,
-John
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go