On 10-01-17 23:25, Bo Peng wrote: > Hi everyone. It occurs to me there might be a more efficient method to > train the value network directly (without using the policy network). > > You are welcome to check my > method: http://withablink.com/GoValueFunction.pdf >
For Method 1 you state: "However, because v is an finer function than V (which is already finer than W), the bias is better controlled than the case of W, and we can use all states in the game to train our network, instead of just picking 1 state in each game to avoid over-fitting" This is intuitively true, and I'm sure it will reduce some overfitting behavior, but empirically the author of Aya reported the opposite, i.e. training on W/L is superior over a linear interpolation to the endgame. It's possible this happens because the V(s) flipping from 0.5 to 0 and 1 more steeply helps the positions where this happens stand out from the MC noise. Combining this with Kensuke's comment, I think it might be worth trying to train V(s) and W(s) simultaneously, but with V(s) being the linear interpolation depending on move number, not the value function (which leaves us without a way to play handicap games and a bunch of other benefits). This could reduce overfitting during training, and if we only use W(s) during gameplay we still have the "strong signal" advantage. -- GCP _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go