On Fri, Oct 20, 2017, 21:48 Petr Baudis <pa...@ucw.cz> wrote:

>   Few open questions I currently have, comments welcome:
>   - there is no input representing the number of captures; is this
>     information somehow implicit or can the learned winrate predictor
>     never truly approximate the true values because of this?

They are using Chinese rules, so prisoners don't matter. There are simply
less stones of one color on the board.

>   - what ballpark values for c_{puct} are reasonable?

The original paper has the value they used. But this likely needs tuning. I
would tune with a supervised network to get started, but you need games for
that. Does it even matter much early on? The network is random :)

>   - why is the dirichlet noise applied only at the root node, if it's
>     useful?

It's only used to get some randomness in the move selection, no ? It's not
actually useful for anything besides that.

>   - the training process is quite lazy - it's not like the network sees
>     each game immediately and adjusts, it looks at last 500k games and
>     samples 1000*2048 positions, meaning about 4 positions per game (if
>     I understood this right) - I wonder what would happen if we trained
>     it more aggressively, and what AlphaGo does during the initial 500k
>     games; currently, I'm training on all positions immediately, I guess
>     I should at least shuffle them ;)

I think the lazyness may be related to the concern that reinforcement
methods can easily "forget" things they had learned before. The value
network training also likes positions from distinct games.


Computer-go mailing list

Reply via email to