Re: [Computer-go] UEC cup 2nd day
FineArt won. (;GM[1]SZ[19] PB[zen] PW[fineart] DT[2017-03-19]RE[W+R]KM[6.5]TM[30]RU[Japanese]PC[UEC, Tokyo] ;B[qd];W[dc];B[pq];W[dp];B[oc];W[po];B[qo];W[qn];B[qp];W[pm] ;B[pj];W[oq];B[pp];W[op];B[oo];W[pn];B[no];W[or];B[pr];W[lq] ;B[lo];W[rn];B[kq];W[kr];B[mr];W[mq];B[lr];W[kp];B[jq];W[lp] ;B[jr];W[jp];B[hq];W[hp];B[gp];W[ho];B[fq];W[ml];B[fn];W[jl] ;B[cn];W[dn];B[dm];W[co];B[bn];W[bo];B[en];W[do];B[ce];W[ed] ;B[mk];W[ll];B[ol];W[om];B[nl];W[nm];B[ic];W[ph];B[og];W[pg] ;B[oh];W[pe];B[ne];W[qe];B[rd];W[qj];B[qk];W[qi];B[pi];W[rk] ;B[gc];W[df];B[cg];W[eh];B[di];W[cm];B[cl];W[bm];B[bl];W[an] ;B[jf];W[dl];B[dk];W[em];B[bd];W[bc];B[cc];W[cb];B[cd];W[fc] ;B[gb];W[gd];B[hd];W[ge];B[hf];W[rq];B[ps];W[pk];B[eb];W[db] ;B[bb];W[ba];B[ab];W[gh];B[fi];W[dg];B[fh];W[fg];B[dh];W[ig] ;B[if];W[gg];B[ii];W[gi];B[mi];W[ei];B[dj];W[ki];B[kj];W[ji] ;B[jj];W[ih];B[dr];W[kg];B[lf];W[nj];B[ni];W[re];B[fj];W[sd] ;B[rc];W[od];B[pd];W[oe];B[nd];W[nc];B[ob];W[rb];B[sc];W[nf] ;B[mf];W[qb];B[pc];W[nb];B[na];W[lc];B[of];W[lb];B[pf];W[rg] ;B[se];W[rf];B[ma];W[sb];B[oa];W[la];B[ld];W[kd];B[kc];W[jc] ;B[ke];W[jd];B[jb];W[kb];B[ja];W[id];B[hc];W[md];B[me];W[mb] ;B[mc];W[ib];B[ia];W[md];B[rh];W[pb];B[pa];W[qh];B[mc];W[qa] ;B[sa];W[md];B[qg];W[le];B[qf];W[ri];B[ld];W[cf];B[bf];W[le] ;B[sh];W[si];B[ld];W[he];B[ie];W[le]) - Original Message - From: "Hiroshi Yamashita"To: Sent: Sunday, March 19, 2017 2:38 PM Subject: Re: [Computer-go] UEC cup 2nd day Final is Fine Art vs Zen ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
[Computer-go] Training a "Score Network" in Monte-Carlo Tree Search
Training a policy network is simple and I have found a Residual Network with Batch Normalization works very well. However training a value network is far more challenging as I have found it indeed very easy to have overfitting, unless one uses the final territory as another prediction target. Even then, it will have difficulty in handling life-and-death because we won't have the computing resources of Tencent... Another separated issue is calling the value network just gives the winning ratio of one board position. So if one wants to directly make moves using the value network, one has to call it for all board positions after all possible moves, which is much slower than calling the policy network (which just needs one call). Recently it occurs to me that training a "score network" may be a better choice than policy / value network. The output of the score network is very simple: it's just the winning ratio of all possible moves, same as Fig 5.a in the Nature paper. ( the pdf version of this document is at http://withablink.com/GoScoreNetwork.pdf ) The score network has four merits: (1) It can directly replace both policy and value network. (2) We can do reinforcement learning on it directly, because we can train it to fit the MCTS result. This may be better than training using policy gradient (as in the Nature paper) because the convergence to optimal-play is guaranteed (because the convergence of MCTS to optimal-play is guaranteed). (3) In fact, one can directly use it to do UCT (MCTS without rollout) and the self-improving process will be even simpler. Because calling it once gives hundreds of children nodes with winning ratio and we can simply add them to our UCT tree (as if we did the rollout) and still use the UCB and selection-expansion-simulation-backpropagation algorithm. Although one might still needs some rollout when the game is close to end (to make sure the score is correct). Some TD(0) might helps as well. (4) Although one can do (2) and (3) for the value network, it is easy to have overfitting because we are just predicting one single number. The score network is better in this aspect. The training process will be like this: (1) Initial training. Use your value network / MCTS to compute the training data for the board positions in your SGFs. (2) Fine-tuning. It might be helpful to then tune it such that it is more likely to give the correct move in your professional game SGFs, i.e. making sure those move are maximizing the winning ratio. In other words, we will be training it as if it is a policy network. I believe this will give a better starting point for the self-improving stage. One possible method is like this: If $\{p_i\}$ are the network outputs and $a$ is the desired action, then we will train $p_a$ to be $\max_i \{p_i\}$ (and probably also needs to reduce the value of other $p_i$ such that some weighted sum of all the $\{p_i\}$ is preserved). (3) Self-improving. One can even randomly generate board positions and train the network to fit the MCTS result. The correlation of the board positions will hence never be a problem. Bo ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Training a "Score Network" in Monte-Carlo Tree Search
A few more words *) Pushing this idea to the extreme, one might want to build a "Tree Network" whose output tries to somehow fit the whole Monte-Carlo Search Tree (including all the win/lose numbers etc.) for the board position. As we know a deep network can fit anything. The structure of the network requires some thinking as we certainly shouldn't directly fit the whole tree. *) To improve the life-and-death knowledge of the network, it might help to make an very aggressive opponent (whose policy is biased towards fighting moves) in self-playing. As another example, if your network has problem with ladder / mirror-go, probably it's better to make an opponent that is fond of ladder / mirror-go moves and use the resulting MCTS result to train your network (instead of patching your code to do a ladder search). *) Could we build a distributed training project like Folding@home / mining bitcoins? Otherwise individuals / small groups won't have any chance against large companies. On 3/20/17, 03:48, "Computer-go on behalf of Bo Peng"wrote: >Training a policy network is simple and I have found a Residual Network >with Batch Normalization works very well. However training a value network >is far more challenging as I have found it indeed very easy to have >overfitting, unless one uses the final territory as another prediction >target. Even then, it will have difficulty in handling life-and-death >because we won't have the computing resources of Tencent... > >Another separated issue is calling the value network just gives the >winning ratio of one board position. So if one wants to directly make >moves using the value network, one has to call it for all board positions >after all possible moves, which is much slower than calling the policy >network (which just needs one call). > >Recently it occurs to me that training a "score network" may be a better >choice than policy / value network. The output of the score network is >very simple: it's just the winning ratio of all possible moves, same as >Fig 5.a in the Nature paper. > >( the pdf version of this document is at >http://withablink.com/GoScoreNetwork.pdf ) > >The score network has four merits: > >(1) It can directly replace both policy and value network. > >(2) We can do reinforcement learning on it directly, because we can train >it to fit the MCTS result. This may be better than training using policy >gradient (as in the Nature paper) because the convergence to optimal-play >is guaranteed (because the convergence of MCTS to optimal-play is >guaranteed). > >(3) In fact, one can directly use it to do UCT (MCTS without rollout) and >the self-improving process will be even simpler. Because calling it once >gives hundreds of children nodes with winning ratio and we can simply add >them to our UCT tree (as if we did the rollout) and still use the UCB and >selection-expansion-simulation-backpropagation algorithm. Although one >might still needs some rollout when the game is close to end (to make sure >the score is correct). Some TD(0) might helps as well. > >(4) Although one can do (2) and (3) for the value network, it is easy to >have overfitting because we are just predicting one single number. The >score network is better in this aspect. > >The training process will be like this: > >(1) Initial training. Use your value network / MCTS to compute the >training data for the board positions in your SGFs. > >(2) Fine-tuning. It might be helpful to then tune it such that it is more >likely to give the correct move in your professional game SGFs, i.e. >making sure those move are maximizing the winning ratio. In other words, >we will be training it as if it is a policy network. I believe this will >give a better starting point for the self-improving stage. > >One possible method is like this: If $\{p_i\}$ are the network outputs and >$a$ is the desired action, then we will train $p_a$ to be $\max_i \{p_i\}$ >(and probably also needs to reduce the value of other $p_i$ such that some >weighted sum of all the $\{p_i\}$ is preserved). > >(3) Self-improving. One can even randomly generate board positions and >train the network to fit the MCTS result. The correlation of the board >positions will hence never be a problem. > >Bo > > >___ >Computer-go mailing list >Computer-go@computer-go.org >http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go