Patrick, for what it's worth, I think almost no-one will have seen your email because laposte.net claims it's forged. Either your or laposte.net's email server is mis-configured.
> Refering to Silver's paper terminology and results, greedy policy > using RL Policy Network beated greedy policy using SL Policy > Network, but PV-MCTS performed better when used with SL Policy > Networks than with RL-Policy Network. Authors hypothetized that it is > "presumably because humans select a diverse beam of promising moves, > whereas RL optimizes for the single best move". I've always found this to be a rather strange argument. If the wideness of the selection is an issue, this can be resolved by tuning the UCT parameters and prior differently, it doesn't need to be tuned in the DCNN itself. Someone on the list made a different argument: when there are several good shape moves and one that tactically resolves the situation, SL may prefer shape moves. But SL has bad tactical awareness, so resolving the situation might be better for it and this is what RL learns to strongly favor. Compare this with playouts (who also have little tactical awareness themselves) strongly favoring settling the local situation. I find this a more persuasive argument. > Thus, one quality of a policy function to be used to bias the search > in a MCTS is a good balance between 'sharpness' (being selective) > and 'open-mindness' (giving a chance to some low-value moves which > could turn to be important; avoid blind spot). Because of the above I disagree with this: this is a matter of tuning the UCT parameters. The goal of the DCNN should be to give an objective as possible judgment as to the likelihood that a move is best. > Coudld someone direct me to litterature exploring this idea or > explaining why it doesnt't work in practice ? I think simply no-one has tried it yet, at least publicly. There are many other ideas to explore. > I'm wondering if someone has ever considered using a gradient of > temperature, in the softmax layer of the policy network, with > temperature parameter varying with depth in the tree, so that the > search is broader in the first levels and becomes narrow in the > deepest levels (ultimately, it would turn the search into rollout to > the end of the game for deepest nodes). Don't typical UCT implementations already do this? If you use priors and scale the priors down with the amount of visits a node has had, you get the described effect. Or the opposite way, if you use progressive widening it has the same effect. You seem to be thinking all of this fudging of probabilities has to be done at the DCNN level, but why not do it in the MCTS/UCT search directly? It has more information, after all. -- GCP _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go