Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
Hi zakki, > I couldn't get positive experiment results on Ray. > Rn's network structure of V and W are similar and share parameters, > but only final convolutional layer are different. > I trained Rn's network to minimize MSE of V(s) + W(s). > It uses only KGS and GoGoD data sets, no self play

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
It¹s nice to see so many discussions. Another reason could be that training a good quality v(s) (or V(s)) may require some different network structures from that of W(s). Usually it is helpful to have an ensemble of different networks, each constructed from different principles. On 1/11/17,

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
Hi, >How do you get the V(s) for those datasets? You play out the endgame >with the Monte Carlo playouts? > >I think one problem with this approach is that errors in the data for >V(s) directly correlate to errors in MC playouts. So a large benefit of >"mixing" the two (otherwise independent)

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Gian-Carlo Pascutto
On 10-01-17 23:25, Bo Peng wrote: > Hi everyone. It occurs to me there might be a more efficient method to > train the value network directly (without using the policy network). > > You are welcome to check my > method: http://withablink.com/GoValueFunction.pdf > For Method 1 you state:

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Kensuke Matsuzaki
Hi, How do you get the V(s) for those datasets? You play out the endgame > with the Monte Carlo playouts? > Yes, I use result of 100 playout from the endgame. Sometimes the result stored in sgf differs from result of playouts. zakki ___ Computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Kensuke Matsuzaki
Hi, I couldn't get positive experiment results on Ray. Rn's network structure of V and W are similar and share parameters, but only final convolutional layer are different. I trained Rn's network to minimize MSE of V(s) + W(s). It uses only KGS and GoGoD data sets, no self play with RL policy.

Re: [Computer-go] Computer-go - Simultaneous policy and value functions reinforcement learning by MCTS-TD-Lambda ?

2017-01-11 Thread patrick.bardou
Hi, 1) Simultaneous policy and value functions reinforcement learning by MCTS-TD-Lambda ? What is a good policy network, from a 'Policy & Value - MCTS' (PV-MCTS) point of view (i.e. in Alphago implementation) ? Refering to Silver's paper terminology and results, greedy policy using RL

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Gian-Carlo Pascutto
On 11-01-17 14:33, Kensuke Matsuzaki wrote: > Hi, > > I couldn't get positive experiment results on Ray. > > Rn's network structure of V and W are similar and share parameters, > but only final convolutional layer are different. > I trained Rn's network to minimize MSE of V(s) + W(s). > It uses

Re: [Computer-go] it's alphago

2017-01-11 Thread Darren Cook
> https://games.slashdot.org/story/17/01/04/2022236/googles-alphago-ai-secretively-won-more-than-50-straight-games-against-worlds-top-go-players The five Lee Sedol games last year never felt like they were probing Alpha Go's potential weaknesses. E.g. things like whole board semeai, complex whole

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Xavier Combelle
Le 11/01/2017 à 16:14, Bo Peng a écrit : > Hi, > >> How do you get the V(s) for those datasets? You play out the endgame >> with the Monte Carlo playouts? >> >> I think one problem with this approach is that errors in the data for >> V(s) directly correlate to errors in MC playouts. So a large

[Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
Hi John, >You say "the perfect policy network can be >derived from the perfect value network (the best next move is the move >that maximises the value for the player, if the value function is >perfect), but not vice versa.", but a perfect policy for both players >can be used to generate a perfect

Re: [Computer-go] Golois5 is KGS 4d

2017-01-11 Thread George Dahl
For people interested in seeing the reviews for ICLR 2017 for the paper: https://openreview.net/forum?id=Bk67W4Yxl On Tue, Jan 10, 2017 at 6:46 AM, Detlef Schmicker wrote: > Very interesting, > > but lets wait some days for getting an idea of the strength, > 4d it reached due to

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Rémi Coulom
Hi, Thanks for sharing your idea. In my experience it is rarely efficient to train value functions from very short term data (ie, next move). TD(lambda), or training from the final outcome of the game is often better, because it uses a longer horizon. But of course, it is difficult to tell

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Bo Peng
Hi Remi, Thanks for sharing your experience. As I am writing this, it seems there could be a third method: the perfect value function shall have the minimax property in the obvious way. So we can train our value function to satisfy the minimax property as well. In fact, we can train it such that