Re: [Computer-go] Value network that doesn't want to learn.

Brian Sheppard via Computer-go Fri, 23 Jun 2017 05:43:57 -0700

>... my value network was trained to tell me the game is balanced at the 
>beginning...


:-)

The best training policy is to select positions that correct errors.

I used the policies below to train a backgammon NN. Together, they reduced the 
expected loss of the network by 50% (cut the error rate in half):

- Select training positions from the program's own games.
        - Can be self-play or versus an opponent.
        - Best is to have a broad panel of opponents.
        - Beneficial to bootstrap with pro games, but then add ONLY training 
examples from program's own games.
- Train only the moves made by the winner of the game
        - Very important for deterministic games!
        - Note that the winner can be either your program or the opponent.
        - If your program wins then training reinforces good behavior; if 
opponent wins then training corrects bad behavior.
- Per game, you should aim to get only a few training examples (3 in 
backgammon. Maybe 10 in Go?). Use two policies:
        - Select positions where the static evaluation of a position is 
significantly different from a deep search
        - Select positions where the move selected by a deep search did not 
have the highest static evaluation. (And in this case you have two training 
positions, which differ by the move chosen.)
        - Of course, you are selecting examples where you did as badly as 
possible.
- The training value of the position is the result of a deep search.
        - This is equivalent to "temporal difference learning", but accelerated 
by the depth of the search.
        - Periodically refresh the training evaluations as your search/eval 
improve.

These policies actively seek out cases where your evaluation function has some 
weakness, so training is definitely focused on improving results in the 
distribution of positions that your program will actually face.

You will need about 30 training examples for every free parameter in your NN. 
You can do the math on how many games that will take. It is inevitable: you 
will train your NN based on blitz games.

Good luck!



_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Value network that doesn't want to learn.

Reply via email to