On 10-01-17 23:25, Bo Peng wrote:
> Hi everyone. It occurs to me there might be a more efficient method to
> train the value network directly (without using the policy network).
> 
> You are welcome to check my
> method: http://withablink.com/GoValueFunction.pdf
> 

For Method 1 you state:

"However, because v is an finer function than V (which is already finer
than W), the bias is better controlled than the case of W, and we can
use all states in the game to train our network, instead of just picking
1 state in each game to avoid over-fitting"

This is intuitively true, and I'm sure it will reduce some overfitting
behavior, but empirically the author of Aya reported the opposite, i.e.
training on W/L is superior over a linear interpolation to the endgame.

It's possible this happens because the V(s) flipping from 0.5 to 0 and 1
more steeply helps the positions where this happens stand out from the
MC noise.

Combining this with Kensuke's comment, I think it might be worth trying
to train V(s) and W(s) simultaneously, but with V(s) being the linear
interpolation depending on move number, not the value function (which
leaves us without a way to play handicap games and a bunch of other
benefits).

This could reduce overfitting during training, and if we only use W(s)
during gameplay we still have the "strong signal" advantage.

-- 
GCP
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to