Re: [Computer-go] Training the value network (a possibly more efficient approach)
So I will start to create software, and if someone want to use it you will be free as free software, and I already found someone who is ready to host the server side. From a practical point of view, I will use public key signing to distribute go software (binary or source), so I will ask the author to sign it and give me their public key. Xavier Combelle Le 12/01/2017 à 11:04, Gian-Carlo Pascutto a écrit : > On 11-01-17 18:09, Xavier Combelle wrote: >> Of course it means distribute at least the binary so, or the source, >> so proprietary software could be reluctant to share it. But for free >> software there should not any problem. If someone is interested by my >> proposition, I would be pleased to realize it. > It is obvious that having a 30M dataset of games between strong players > (i.e. replicating the AlphaGo training set) would be beneficial to the > community. It is clear that most of us are trying to do the same now, > that is somehow trying to learn a value function from the about ~1.5M > KGS+Tygen+GoGoD games while trying to control overfitting via various > measures. (Aya used small network + dropout. Rn trained multiple outputs > on a network of unknown size. I wonder why no-one tried normal L1/L2 > regularization, but then I again I didn't get that working either!) > > Software should also not really be a problem: Leela is free, Ray and > Darkforest are open source. If we can use a pure DCNN player I think > there are several more options, for example I've seen several programs > in Python. You can resolve score disagreement by invoking GNU Go --score > aftermath. > > I think it's an open question though, *how* the games should be > generated, i.e.: > > * Follow AlphaGo procedure but with SL instead of RL player (you can use > bigger or smaller networks too, many tradeoffs possible) > * Play games with full MCTS search and small number of playouts. (More > bias, much higher quality games). > * The author of Aya also stated his procedure. > * Several of those and mix :-) > 0xFA1051C4.asc Description: application/pgp-keys signature.asc Description: OpenPGP digital signature ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
On 11-01-17 18:09, Xavier Combelle wrote: > Of course it means distribute at least the binary so, or the source, > so proprietary software could be reluctant to share it. But for free > software there should not any problem. If someone is interested by my > proposition, I would be pleased to realize it. It is obvious that having a 30M dataset of games between strong players (i.e. replicating the AlphaGo training set) would be beneficial to the community. It is clear that most of us are trying to do the same now, that is somehow trying to learn a value function from the about ~1.5M KGS+Tygen+GoGoD games while trying to control overfitting via various measures. (Aya used small network + dropout. Rn trained multiple outputs on a network of unknown size. I wonder why no-one tried normal L1/L2 regularization, but then I again I didn't get that working either!) Software should also not really be a problem: Leela is free, Ray and Darkforest are open source. If we can use a pure DCNN player I think there are several more options, for example I've seen several programs in Python. You can resolve score disagreement by invoking GNU Go --score aftermath. I think it's an open question though, *how* the games should be generated, i.e.: * Follow AlphaGo procedure but with SL instead of RL player (you can use bigger or smaller networks too, many tradeoffs possible) * Play games with full MCTS search and small number of playouts. (More bias, much higher quality games). * The author of Aya also stated his procedure. * Several of those and mix :-) -- GCP ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
Le 11/01/2017 à 16:14, Bo Peng a écrit : > Hi, > >> How do you get the V(s) for those datasets? You play out the endgame >> with the Monte Carlo playouts? >> >> I think one problem with this approach is that errors in the data for >> V(s) directly correlate to errors in MC playouts. So a large benefit of >> "mixing" the two (otherwise independent) evaluations is lost. > Yes, that is a problem for Human games dataset. > > On the other hand, currently the SL part is relatively easier (it seems > everyone arrives at a 50-60% accuracy), and the main challenges of the RL > part is generating the huge number of self-play games. > > In self-play games we have an accurate end-game v(s) / V(s). And v(s) / > V(s) is able to use the information in self-play games more efficiently. I > think this can be helpful. > Could, some distributed workload such as fishtest for stockfish help to generate huge number of self-play game If it is the case I could create the framework to use it. It is classical programming and as such I should be able to do it (at the opposite of Computer go software which is hard for me by lack of practice). Of course it means distribute at least the binary so, or the source, so proprietary software could be reluctant to share it. But for free software there should not any problem. If someone is interested by my proposition, I would be pleased to realize it. Xavier ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
Hi, >How do you get the V(s) for those datasets? You play out the endgame >with the Monte Carlo playouts? > >I think one problem with this approach is that errors in the data for >V(s) directly correlate to errors in MC playouts. So a large benefit of >"mixing" the two (otherwise independent) evaluations is lost. Yes, that is a problem for Human games dataset. On the other hand, currently the SL part is relatively easier (it seems everyone arrives at a 50-60% accuracy), and the main challenges of the RL part is generating the huge number of self-play games. In self-play games we have an accurate end-game v(s) / V(s). And v(s) / V(s) is able to use the information in self-play games more efficiently. I think this can be helpful. > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
Hi, How do you get the V(s) for those datasets? You play out the endgame > with the Monte Carlo playouts? > Yes, I use result of 100 playout from the endgame. Sometimes the result stored in sgf differs from result of playouts. zakki ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
Hi zakki, > I couldn't get positive experiment results on Ray. > Rn's network structure of V and W are similar and share parameters, > but only final convolutional layer are different. > I trained Rn's network to minimize MSE of V(s) + W(s). > It uses only KGS and GoGoD data sets, no self play with RL policy. Thanks for sharing your results. Have you tried more stages of training V, in which the second method in my PDF is also used (i.e. Train the value network to fit the "observed move", as I feel it could improve the "awareness / sharpness" of V). Bo ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
It¹s nice to see so many discussions. Another reason could be that training a good quality v(s) (or V(s)) may require some different network structures from that of W(s). Usually it is helpful to have an ensemble of different networks, each constructed from different principles. On 1/11/17, 22:19, "Computer-go on behalf of Gian-Carlo Pascutto"wrote: > >Combining this with Kensuke's comment, I think it might be worth trying >to train V(s) and W(s) simultaneously, but with V(s) being the linear >interpolation depending on move number, not the value function (which >leaves us without a way to play handicap games and a bunch of other >benefits). > >This could reduce overfitting during training, and if we only use W(s) >during gameplay we still have the "strong signal" advantage. > >-- >GCP >___ >Computer-go mailing list >Computer-go@computer-go.org >http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
On 10-01-17 23:25, Bo Peng wrote: > Hi everyone. It occurs to me there might be a more efficient method to > train the value network directly (without using the policy network). > > You are welcome to check my > method: http://withablink.com/GoValueFunction.pdf > For Method 1 you state: "However, because v is an finer function than V (which is already finer than W), the bias is better controlled than the case of W, and we can use all states in the game to train our network, instead of just picking 1 state in each game to avoid over-fitting" This is intuitively true, and I'm sure it will reduce some overfitting behavior, but empirically the author of Aya reported the opposite, i.e. training on W/L is superior over a linear interpolation to the endgame. It's possible this happens because the V(s) flipping from 0.5 to 0 and 1 more steeply helps the positions where this happens stand out from the MC noise. Combining this with Kensuke's comment, I think it might be worth trying to train V(s) and W(s) simultaneously, but with V(s) being the linear interpolation depending on move number, not the value function (which leaves us without a way to play handicap games and a bunch of other benefits). This could reduce overfitting during training, and if we only use W(s) during gameplay we still have the "strong signal" advantage. -- GCP ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
On 11-01-17 14:33, Kensuke Matsuzaki wrote: > Hi, > > I couldn't get positive experiment results on Ray. > > Rn's network structure of V and W are similar and share parameters, > but only final convolutional layer are different. > I trained Rn's network to minimize MSE of V(s) + W(s). > It uses only KGS and GoGoD data sets, no self play with RL policy. How do you get the V(s) for those datasets? You play out the endgame with the Monte Carlo playouts? I think one problem with this approach is that errors in the data for V(s) directly correlate to errors in MC playouts. So a large benefit of "mixing" the two (otherwise independent) evaluations is lost. This problem doesn't exist when using raw W/L data from those datasets, or when using SL/RL playouts. (But note that using the full engine to produce games *would* suffer from the same correlation. That might be entirely offset by the higher quality of the data, though.) > But I have no idea about how to use V(s) or v(s) in MCTS. V(s) seems potentially useful for handicap games where W(s) is no longer accurate. I don't see any benefit over W(s) for even games. -- GCP ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
Hi, I couldn't get positive experiment results on Ray. Rn's network structure of V and W are similar and share parameters, but only final convolutional layer are different. I trained Rn's network to minimize MSE of V(s) + W(s). It uses only KGS and GoGoD data sets, no self play with RL policy. When trained only W(s), the network overfits, but to train V(s) + W(s) same time prevents overfitting. But I have no idea about how to use V(s) or v(s) in MCTS. Rn.3.0-4c plays with W(s): winning rate. http://www.yss-aya.com/19x19/cgos/cross/Rn.3.0-4c.html 3394 elo Rn.3.1-4c plays with V(s): sum of ownership. bit weaker # MCTS part is tuned for W(s) now, so something may be wrong. http://www.yss-aya.com/cgos/19x19/cross/Rn.3.1-4c.html 3218 elo zakki 2017年1月11日(水) 19:49 Bo Peng: > Hi Remi, > > Thanks for sharing your experience. > > As I am writing this, it seems there could be a third method: the perfect > value function shall have the minimax property in the obvious way. So we > can train our value function to satisfy the minimax property as well. In > fact, we can train it such that a shallow-level MCTS gives as close a > result as a deeper-level MCTS. This can be regarded as some kind of > bootstrapping. > > Wonder if you have tried this. Seems might be a natural idea... > > Bo > > On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom" > > wrote: > > >Hi, > > > >Thanks for sharing your idea. > > > >In my experience it is rarely efficient to train value functions from > >very short term data (ie, next move). TD(lambda), or training from the > >final outcome of the game is often better, because it uses a longer > >horizon. But of course, it is difficult to tell without experiments > >whether your idea would work or not. The advantage of your ideas is that > >you can collect a lot of training data more easily. > > > >Rémi > > > >- Mail original - > >De: "Bo Peng" > >À: computer-go@computer-go.org > >Envoyé: Mardi 10 Janvier 2017 23:25:19 > >Objet: [Computer-go] Training the value network (a possibly more > >efficient approach) > > > > > >Hi everyone. It occurs to me there might be a more efficient method to > >train the value network directly (without using the policy network). > > > > > >You are welcome to check my method: > >http://withablink.com/GoValueFunction.pdf > > > > > >Let me know if there is any silly mistakes :) > > > > > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
Hi Remi, Thanks for sharing your experience. As I am writing this, it seems there could be a third method: the perfect value function shall have the minimax property in the obvious way. So we can train our value function to satisfy the minimax property as well. In fact, we can train it such that a shallow-level MCTS gives as close a result as a deeper-level MCTS. This can be regarded as some kind of bootstrapping. Wonder if you have tried this. Seems might be a natural idea... Bo On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom"wrote: >Hi, > >Thanks for sharing your idea. > >In my experience it is rarely efficient to train value functions from >very short term data (ie, next move). TD(lambda), or training from the >final outcome of the game is often better, because it uses a longer >horizon. But of course, it is difficult to tell without experiments >whether your idea would work or not. The advantage of your ideas is that >you can collect a lot of training data more easily. > >Rémi > >- Mail original - >De: "Bo Peng" >À: computer-go@computer-go.org >Envoyé: Mardi 10 Janvier 2017 23:25:19 >Objet: [Computer-go] Training the value network (a possibly more >efficient approach) > > >Hi everyone. It occurs to me there might be a more efficient method to >train the value network directly (without using the policy network). > > >You are welcome to check my method: >http://withablink.com/GoValueFunction.pdf > > >Let me know if there is any silly mistakes :) > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
Hi, Thanks for sharing your idea. In my experience it is rarely efficient to train value functions from very short term data (ie, next move). TD(lambda), or training from the final outcome of the game is often better, because it uses a longer horizon. But of course, it is difficult to tell without experiments whether your idea would work or not. The advantage of your ideas is that you can collect a lot of training data more easily. Rémi - Mail original - De: "Bo Peng"À: computer-go@computer-go.org Envoyé: Mardi 10 Janvier 2017 23:25:19 Objet: [Computer-go] Training the value network (a possibly more efficient approach) Hi everyone. It occurs to me there might be a more efficient method to train the value network directly (without using the policy network). You are welcome to check my method: http://withablink.com/GoValueFunction.pdf Let me know if there is any silly mistakes :) ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
I was writing code along those lines when AlphaGo debuted. When it became clear that AlphaGo had succeeded, then I ceased work. So I don’t know whether this strategy will succeed, but the theoretical merits were good enough to encourage me. Best of luck, Brian From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of Bo Peng Sent: Tuesday, January 10, 2017 5:25 PM To: computer-go@computer-go.org Subject: [Computer-go] Training the value network (a possibly more efficient approach) Hi everyone. It occurs to me there might be a more efficient method to train the value network directly (without using the policy network). You are welcome to check my method: http://withablink.com/GoValueFunction.pdf Let me know if there is any silly mistakes :) ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training the value network (a possibly more efficient approach)
hi Bo, > Let me know if there is any silly mistakes :) You say "the perfect policy network can be derived from the perfect value network (the best next move is the move that maximises the value for the player, if the value function is perfect), but not vice versa.", but a perfect policy for both players can be used to generate a perfect playout which yields the perfect value... regards, -John ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go