Hi,

  I got first *somewhat* positive results in my attempt to reproduce
AlphaGo Zero - 25% winrate against GNUGo on the easiest reasonable task
- 7x7 board. :)  a.k.a.

        "Sometimes beating GNUGo on a tiny board" without human knowledge

(much wow!)

  Normally this would be a pretty weak result much but (A) I wanted to
help calibrate other efforts on larger boards that are possibly still
at the "random" stage, and (B) I'll probably move on to other projects
again soon, so this might be as good as it gets for me.

  I started the project by replacing MC simulations with a Keras model
in my 550-line educational Go program Michi - it lived in its `nnet`
branch until now when I separated it to a project on its own:

        https://github.com/rossumai/nochi

Starting from a small base means that the codebase is tiny and should be
easy to follow, though it's not at all as tidy as Michi is.

You can grab the current training state (== pickled archive of selfplay
positions used for replay, chronological) and neural network weights
from the github's "Releases" page:

        https://github.com/rossumai/nochi/releases/tag/G171107T013304_000000150

  This is a truly "zero-knowledge" system like AlphaGo Zero - it needs
no supervision, and it contains no Monte Carlo simulations or other
heuristics. But it's not entirely 1:1, I did some tweaks which I thought
might help early convergence:

  * AlphaGo used 19 resnet layers for 19x19, so I used 7 layers for 7x7.
  * The neural network is updated after _every_ game, _twice_, on _all_
    positions plus 64 randomly sampled positions from the entire history,
    this all done four times - on original position and the three
    symmetry flips (but I was too lazy to implement 90\deg rotation).
  * Instead of supplying last 8 positions as the network input I feed
    just the last position plus two indicator matrices showing
    the location of the last and second-to-last move.
  * No symmetry pruning during tree search.
  * Value function is trained with cross-entropy rather than MSE,
    no L2 regularization, and plain Adam rather than hand-tuned SGD (but
    the annealing is reset time by time due to manual restarts of the
    script from a checkpoint).
  * No resign auto-threshold but it is important to play 25% games
    without resigning to escale local "optima".
  * 1/Temperature is 2 for first three moves.
  * Initially I used 1000 "simulations" per move, but by mistake, last
    1500 games when the network improved significantly (see below) were
    run with 2000 simulations per move.  So that might matter.

  This has been running for two weeks, self-playing 8500 games.  A week
ago its moves already looked a bit natural but it was stuck in various
local optima.  Three days ago it has beaten GNUGo once across 20 games.
Now five times across 20 games - so I'll let it self-play a little longer
as it might surpass GNUGo quickly at this point?  Also this late
improvement coincides with the increased simulation number.

  At the same time, Nochi supports supervised training (with the rest
kept the same) which I'm now experimenting with on 19x19.

  Happy training,

-- 
                                        Petr Baudis, Rossum
        Run before you walk! Fly before you crawl! Keep moving forward!
        If we fail, I'd rather fail really hugely.  -- Moist von Lipwig
_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to