Re: [Computer-go] Computer-go - Simultaneous policy and value functions reinforcement learning by MCTS-TD-Lambda ?

Gian-Carlo Pascutto Thu, 12 Jan 2017 01:46:41 -0800

Patrick, for what it's worth, I think almost no-one will have seen your
email because laposte.net claims it's forged. Either your or
laposte.net's email server is mis-configured.


> Refering to Silver's paper terminology and results, greedy policy 
> using RL Policy Network beated greedy policy using SL Policy
> Network, but PV-MCTS performed better when used with SL Policy
> Networks than with RL-Policy Network. Authors hypothetized that it is
> "presumably because humans select a diverse beam of promising moves,
> whereas RL optimizes for the single best move".

I've always found this to be a rather strange argument. If the wideness
of the selection is an issue, this can be resolved by tuning the UCT
parameters and prior differently, it doesn't need to be tuned in the
DCNN itself.

Someone on the list made a different argument: when there are several
good shape moves and one that tactically resolves the situation, SL may
prefer shape moves. But SL has bad tactical awareness, so resolving the
situation might be better for it and this is what RL learns to strongly
favor. Compare this with playouts (who also have little tactical
awareness themselves) strongly favoring settling the local situation. I
find this a more persuasive argument.

> Thus, one quality of a policy function to be used to bias the search
>  in a MCTS is a good balance between 'sharpness' (being selective)
> and 'open-mindness' (giving a chance to some low-value moves which
> could turn to be important; avoid blind spot).

Because of the above I disagree with this: this is a matter of tuning
the UCT parameters. The goal of the DCNN should be to give an objective
as possible judgment as to the likelihood that a move is best.

> Coudld someone direct me to litterature exploring this idea or 
> explaining why it doesnt't work in practice ?

I think simply no-one has tried it yet, at least publicly. There are
many other ideas to explore.

> I'm wondering  if someone has ever considered using a gradient of 
> temperature, in the softmax layer of the policy network,  with 
> temperature parameter varying with depth in the tree, so that the 
> search is broader in the first levels and becomes narrow in the 
> deepest levels (ultimately, it would turn the search into rollout to 
> the end of the game for deepest nodes). 

Don't typical UCT implementations already do this? If you use priors and
scale the priors down with the amount of visits a node has had, you get
the described effect. Or the opposite way, if you use progressive
widening it has the same effect.

You seem to be thinking all of this fudging of probabilities has to be
done at the DCNN level, but why not do it in the MCTS/UCT search
directly? It has more information, after all.

-- 
GCP
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Computer-go - Simultaneous policy and value functions reinforcement learning by MCTS-TD-Lambda ?

Reply via email to