@Gian-Carlo,
Indeed, multi-labelled value net/ head sounds a good way to inject more signal 
into the network, accorging to that paper, thus to inject more reinforcemenf 
learning signal for learning from scratch.
I was wondering if it could also be beneficial for bootstrapping the policy 
net/ head as well. Since score of randomly played games is likely to have high 
variance, I suppose that moves in most positions close to a final position will 
have very similar action values, as estimated  by the MCTS search, hence weak 
reinforcement signal (all/most of the moves leading to the same outcome, either 
win or loss). Using a multi-labelled value head could produce multi-labelled 
action values which might have more spread than with a fixed 6.5 komi and would 
allow better ranking of moves (by playing on the komi / averaging over the 
various possible komi ?)
Otherwise said, learning to increase the final score might be a good starting 
point for the policy net / head. At least in bootstrapping phase.
Combined with prioritized sampling biased towards low reverse move count from 
endgame, as I mentionned in an earlier post and as you propose for first round 
of learning win/loss of final position.
Patrick

-------- Message d'origine --------
De : computer-go-requ...@computer-go.org 
Date : 26/10/2017  16:17  (GMT+01:00) 
À : computer-go@computer-go.org 
Objet : Computer-go Digest, Vol 93, Issue 34 


------------------------------

Message: 2
Date: Thu, 26 Oct 2017 15:17:43 +0200
From: Gian-Carlo Pascutto <g...@sjeng.org>
To: computer-go@computer-go.org
Subject: Re: [Computer-go] AlphaGo Zero
Message-ID: <8c872e71-4864-0a19-d3df-9fe1c48d2...@sjeng.org>
Content-Type: text/plain; charset=utf-8

On 25-10-17 16:00, Petr Baudis wrote:
> That makes sense.  I still hope that with a much more aggressive 
> training schedule we could train a reasonable Go player, perhaps at
> the expense of worse scaling at very high elos...  (At least I feel 
> optimistic after discovering a stupid bug in my code.)

By the way, a trivial observation: the initial network is random, so
there's no point in using it for playing the first batch of games. It
won't do anything useful until it has run a learning pass on a bunch of
"win/loss" scored games and it can at least tell who is the likely
winner in the final position (even if it mostly won't be able to make
territory at first).

This suggests that bootstrapping probably wants 500k starting games with
just random moves.

FWIW, it does not seem easy to get the value part of the network to
converge in the dual-res architecture, even when taking the appropriate
steps (1% weighting on error, strong regularizer).

-- 
GCP


------------------------------

Message: 3
Date: Thu, 26 Oct 2017 15:55:23 +0200
From: Roel van Engelen <gosuba...@gmail.com>
To: computer-go@computer-go.org
Subject: Re: [Computer-go] Source code (Was: Reducing network size?
        (Was: AlphaGo Zero))
Message-ID:
        <CA+RUuO+hqs_rKnoVEQCDzV8oO=e4b9_s46juoqagnd8jg+4...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

@Gian-Carlo Pascutto

Since training uses a ridiculous amount of computing power i wonder if it
would
be useful to make certain changes for future research, like training the
value head
with multiple komi values <https://arxiv.org/pdf/1705.10701.pdf>


_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to