Re: [Computer-go] AlphaGo Zero

Petr Baudis Wed, 25 Oct 2017 09:10:53 -0700

On Fri, Oct 20, 2017 at 08:02:02PM +0000, Gian-Carlo Pascutto wrote:
> On Fri, Oct 20, 2017, 21:48 Petr Baudis <[email protected]> wrote:
> 
> >   Few open questions I currently have, comments welcome:
> >
> >   - there is no input representing the number of captures; is this
> >     information somehow implicit or can the learned winrate predictor
> >     never truly approximate the true values because of this?
> >
> 
> They are using Chinese rules, so prisoners don't matter. There are simply
> less stones of one color on the board.


  Right!  No idea what was I thinking.

> >   - what ballpark values for c_{puct} are reasonable?
> >
> 
> The original paper has the value they used. But this likely needs tuning. I
> would tune with a supervised network to get started, but you need games for
> that. Does it even matter much early on? The network is random :)

  The network actually adapts quite rapidly initially, in my experience.
(Doesn't mean it improves - it adapts within local optima of the few
games it played so far.)

> >   - why is the dirichlet noise applied only at the root node, if it's
> >     useful?
> >
> 
> It's only used to get some randomness in the move selection, no ? It's not
> actually useful for anything besides that.

  Yes, but why wouldn't you want that randomness in the second or third
move?

> >   - the training process is quite lazy - it's not like the network sees
> >     each game immediately and adjusts, it looks at last 500k games and
> >     samples 1000*2048 positions, meaning about 4 positions per game (if
> >     I understood this right) - I wonder what would happen if we trained
> >     it more aggressively, and what AlphaGo does during the initial 500k
> >     games; currently, I'm training on all positions immediately, I guess
> >     I should at least shuffle them ;)
> >
> 
> I think the lazyness may be related to the concern that reinforcement
> methods can easily "forget" things they had learned before. The value
> network training also likes positions from distinct games.

  That makes sense.  I still hope that with a much more aggressive
training schedule we could train a reasonable Go player, perhaps at the
expense of worse scaling at very high elos...  (At least I feel
optimistic after discovering a stupid bug in my code.)

-- 
                                        Petr Baudis, Rossum
        Run before you walk! Fly before you crawl! Keep moving forward!
        If we fail, I'd rather fail really hugely.  -- Moist von Lipwig
_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] AlphaGo Zero

Reply via email to