Re: [Computer-go] UEC cup 2nd day

2017-03-19 Thread Hiroshi Yamashita

FineArt won.

(;GM[1]SZ[19]
PB[zen]
PW[fineart]
DT[2017-03-19]RE[W+R]KM[6.5]TM[30]RU[Japanese]PC[UEC, Tokyo]
;B[qd];W[dc];B[pq];W[dp];B[oc];W[po];B[qo];W[qn];B[qp];W[pm]
;B[pj];W[oq];B[pp];W[op];B[oo];W[pn];B[no];W[or];B[pr];W[lq]
;B[lo];W[rn];B[kq];W[kr];B[mr];W[mq];B[lr];W[kp];B[jq];W[lp]
;B[jr];W[jp];B[hq];W[hp];B[gp];W[ho];B[fq];W[ml];B[fn];W[jl]
;B[cn];W[dn];B[dm];W[co];B[bn];W[bo];B[en];W[do];B[ce];W[ed]
;B[mk];W[ll];B[ol];W[om];B[nl];W[nm];B[ic];W[ph];B[og];W[pg]
;B[oh];W[pe];B[ne];W[qe];B[rd];W[qj];B[qk];W[qi];B[pi];W[rk]
;B[gc];W[df];B[cg];W[eh];B[di];W[cm];B[cl];W[bm];B[bl];W[an]
;B[jf];W[dl];B[dk];W[em];B[bd];W[bc];B[cc];W[cb];B[cd];W[fc]
;B[gb];W[gd];B[hd];W[ge];B[hf];W[rq];B[ps];W[pk];B[eb];W[db]
;B[bb];W[ba];B[ab];W[gh];B[fi];W[dg];B[fh];W[fg];B[dh];W[ig]
;B[if];W[gg];B[ii];W[gi];B[mi];W[ei];B[dj];W[ki];B[kj];W[ji]
;B[jj];W[ih];B[dr];W[kg];B[lf];W[nj];B[ni];W[re];B[fj];W[sd]
;B[rc];W[od];B[pd];W[oe];B[nd];W[nc];B[ob];W[rb];B[sc];W[nf]
;B[mf];W[qb];B[pc];W[nb];B[na];W[lc];B[of];W[lb];B[pf];W[rg]
;B[se];W[rf];B[ma];W[sb];B[oa];W[la];B[ld];W[kd];B[kc];W[jc]
;B[ke];W[jd];B[jb];W[kb];B[ja];W[id];B[hc];W[md];B[me];W[mb]
;B[mc];W[ib];B[ia];W[md];B[rh];W[pb];B[pa];W[qh];B[mc];W[qa]
;B[sa];W[md];B[qg];W[le];B[qf];W[ri];B[ld];W[cf];B[bf];W[le]
;B[sh];W[si];B[ld];W[he];B[ie];W[le])


- Original Message - 
From: "Hiroshi Yamashita" 

To: 
Sent: Sunday, March 19, 2017 2:38 PM
Subject: Re: [Computer-go] UEC cup 2nd day



Final is

Fine Art vs Zen


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] Training a "Score Network" in Monte-Carlo Tree Search

2017-03-19 Thread Bo Peng
Training a policy network is simple and I have found a Residual Network
with Batch Normalization works very well. However training a value network
is far more challenging as I have found it indeed very easy to have
overfitting, unless one uses the final territory as another prediction
target. Even then, it will have difficulty in handling life-and-death
because we won't have the computing resources of Tencent...

Another separated issue is calling the value network just gives the
winning ratio of one board position. So if one wants to directly make
moves using the value network, one has to call it for all board positions
after all possible moves, which is much slower than calling the policy
network (which just needs one call).

Recently it occurs to me that training a "score network" may be a better
choice than policy / value network. The output of the score network is
very simple: it's just the winning ratio of all possible moves, same as
Fig 5.a in the Nature paper.

( the pdf version of this document is at
http://withablink.com/GoScoreNetwork.pdf )

The score network has four merits:

(1) It can directly replace both policy and value network.

(2) We can do reinforcement learning on it directly, because we can train
it to fit the MCTS result. This may be better than training using policy
gradient (as in the Nature paper) because the convergence to optimal-play
is guaranteed (because the convergence of MCTS to optimal-play is
guaranteed).

(3) In fact, one can directly use it to do UCT (MCTS without rollout) and
the self-improving process will be even simpler. Because calling it once
gives hundreds of children nodes with winning ratio and we can simply add
them to our UCT tree (as if we did the rollout) and still use the UCB and
selection-expansion-simulation-backpropagation algorithm. Although one
might still needs some rollout when the game is close to end (to make sure
the score is correct). Some TD(0) might helps as well.

(4) Although one can do (2) and (3) for the value network, it is easy to
have overfitting because we are just predicting one single number. The
score network is better in this aspect.

The training process will be like this:

(1) Initial training. Use your value network / MCTS to compute the
training data for the board positions in your SGFs.

(2) Fine-tuning. It might be helpful to then tune it such that it is more
likely to give the correct move in your professional game SGFs, i.e.
making sure those move are maximizing the winning ratio. In other words,
we will be training it as if it is a policy network. I believe this will
give a better starting point for the self-improving stage.

One possible method is like this: If $\{p_i\}$ are the network outputs and
$a$ is the desired action, then we will train $p_a$ to be $\max_i \{p_i\}$
(and probably also needs to reduce the value of other $p_i$ such that some
weighted sum of all the $\{p_i\}$ is preserved).

(3) Self-improving. One can even randomly generate board positions and
train the network to fit the MCTS result. The correlation of the board
positions will hence never be a problem.

Bo


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training a "Score Network" in Monte-Carlo Tree Search

2017-03-19 Thread Bo Peng
A few more wordsŠ

*) Pushing this idea to the extreme, one might want to build a "Tree
Network" whose output tries to somehow fit the whole Monte-Carlo Search
Tree (including all the win/lose numbers etc.) for the board position. As
we know a deep network can fit anything. The structure of the network
requires some thinking as we certainly shouldn't directly fit the whole
tree.

*) To improve the life-and-death knowledge of the network, it might help
to make an very aggressive opponent (whose policy is biased towards
fighting moves) in self-playing. As another example, if your network has
problem with ladder / mirror-go, probably it's better to make an opponent
that is fond of ladder / mirror-go moves and use the resulting MCTS result
to train your network (instead of patching your code to do a ladder
search).

*) Could we build a distributed training project like Folding@home /
mining bitcoins? Otherwise individuals / small groups won't have any
chance against large companies.

On 3/20/17, 03:48, "Computer-go on behalf of Bo Peng"
 wrote:

>Training a policy network is simple and I have found a Residual Network
>with Batch Normalization works very well. However training a value network
>is far more challenging as I have found it indeed very easy to have
>overfitting, unless one uses the final territory as another prediction
>target. Even then, it will have difficulty in handling life-and-death
>because we won't have the computing resources of Tencent...
>
>Another separated issue is calling the value network just gives the
>winning ratio of one board position. So if one wants to directly make
>moves using the value network, one has to call it for all board positions
>after all possible moves, which is much slower than calling the policy
>network (which just needs one call).
>
>Recently it occurs to me that training a "score network" may be a better
>choice than policy / value network. The output of the score network is
>very simple: it's just the winning ratio of all possible moves, same as
>Fig 5.a in the Nature paper.
>
>( the pdf version of this document is at
>http://withablink.com/GoScoreNetwork.pdf )
>
>The score network has four merits:
>
>(1) It can directly replace both policy and value network.
>
>(2) We can do reinforcement learning on it directly, because we can train
>it to fit the MCTS result. This may be better than training using policy
>gradient (as in the Nature paper) because the convergence to optimal-play
>is guaranteed (because the convergence of MCTS to optimal-play is
>guaranteed).
>
>(3) In fact, one can directly use it to do UCT (MCTS without rollout) and
>the self-improving process will be even simpler. Because calling it once
>gives hundreds of children nodes with winning ratio and we can simply add
>them to our UCT tree (as if we did the rollout) and still use the UCB and
>selection-expansion-simulation-backpropagation algorithm. Although one
>might still needs some rollout when the game is close to end (to make sure
>the score is correct). Some TD(0) might helps as well.
>
>(4) Although one can do (2) and (3) for the value network, it is easy to
>have overfitting because we are just predicting one single number. The
>score network is better in this aspect.
>
>The training process will be like this:
>
>(1) Initial training. Use your value network / MCTS to compute the
>training data for the board positions in your SGFs.
>
>(2) Fine-tuning. It might be helpful to then tune it such that it is more
>likely to give the correct move in your professional game SGFs, i.e.
>making sure those move are maximizing the winning ratio. In other words,
>we will be training it as if it is a policy network. I believe this will
>give a better starting point for the self-improving stage.
>
>One possible method is like this: If $\{p_i\}$ are the network outputs and
>$a$ is the desired action, then we will train $p_a$ to be $\max_i \{p_i\}$
>(and probably also needs to reduce the value of other $p_i$ such that some
>weighted sum of all the $\{p_i\}$ is preserved).
>
>(3) Self-improving. One can even randomly generate board positions and
>train the network to fit the MCTS result. The correlation of the board
>positions will hence never be a problem.
>
>Bo
>
>
>___
>Computer-go mailing list
>Computer-go@computer-go.org
>http://computer-go.org/mailman/listinfo/computer-go


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go