Re: [Computer-go] PUCT formula

2018-03-09 Thread Brian Sheppard via Computer-go
Thanks for the explanation. I agree that there is no actual consistency in 
exploration terms across historical papers.

I confirmed that the PUCT formulas across the AG, AGZ, and AZ papers are all 
consistent. That is unlikely to be an error. So now I am wondering whether the 
faster decay is useful for bootstrapping the learner. A short search (800 
trials) can run out of trials if it sticks with its first thought for too long.

So... following GCP's advice, I will adopt the AGZ formula as published. And 
not worry much about it, as MM advises.  :-)

Many thanks to all.

-Original Message-
From: Computer-go <computer-go-boun...@computer-go.org> On Behalf Of Martin 
Mueller
Sent: Friday, March 9, 2018 5:13 PM
To: computer-go@computer-go.org
Subject: Re: [Computer-go] PUCT formula

I talked to Chenjun just now so this is what we both remember.
The PUCB formula as published in Chris Rosin’s paper actually has an additive 
knowledge term, and it looks nothing like the two different PUCT variants tried 
in AlphaGo and our paper.

Chenjun tried an additive term as in Chris’ paper first, and it did not work 
well for him. Then he tried the “UCT-like” PUCT as in our paper, with the decay 
term under the square root. This worked well for him. He never tried the 
AlphaGo formula with the faster decay.

Beyond the published papers, my memory is that many people tried many different 
combinations of knowledge terms and of decay formulas over time. I have never 
read any systematic comparison, or any principled argument for which type of 
decay is “best” for which type of knowledge, and why. It must depend on the 
quality of knowledge, amongst many other things.

There are also earlier MoGo papers that combined many different evaluation 
terms into one formula and tested them empirically.

Martin

> However, the formula in the AGZ paper doesn't look like any "UCT variant". 
> Formula from paper: Cpuct * P(s,a) * sqrt(Sum(N(s,b))) / (1 + N(s,a)) Note 
> that there is no logarithmic term, and the division by N+1 falls outside the 
> sqrt. For comparison, a normal UCT term looks like sqrt(ln(sum(N(s,b))) / (1 
> + N))
> 
> Since I asked my question, I found that other people have also noticed a 
> discrepancy. I saw a post on a DeepChem board about this subject. I also 
> found a paper 
> (https://webdocs.cs.ualberta.ca/~mmueller/ps/2016/2016-Integrating-Factorization-Ranked.pdf)
>  by our old friends Chenjun Xiao and Martin Muller:
> 
> "We apply a variant of PUCT [11] formula which is used in AlphaGo [12] to 
> integrate FBT knowledge in MCTS. " But then the formula that they give 
> differs: argmax((Q(s,a) + Cpuct * P(s,a) * sqrt( lg(N(s)) / (1 + N(s,a)))
> 
> I am guessing that Chenjun and Martin decided (or knew) that the AGZ paper 
> was incorrect and modified the equation accordingly.
> 
> Anyone remember anything about this?
> 
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] PUCT formula

2018-03-09 Thread Martin Mueller
I talked to Chenjun just now so this is what we both remember.
The PUCB formula as published in Chris Rosin’s paper actually has an additive 
knowledge term, and it looks nothing like the two different PUCT variants tried 
in AlphaGo and our paper.

Chenjun tried an additive term as in Chris’ paper first, and it did not work 
well for him. Then he tried the “UCT-like” PUCT as in our paper, with the decay 
term under the square root. This worked well for him. He never tried the 
AlphaGo formula with the faster decay.

Beyond the published papers, my memory is that many people tried many different 
combinations of knowledge terms and of decay formulas over time. I have never 
read any systematic comparison, or any principled argument for which type of 
decay is “best” for which type of knowledge, and why. It must depend on the 
quality of knowledge, amongst many other things.

There are also earlier MoGo papers that combined many different evaluation 
terms into one formula and tested them empirically.

Martin

> However, the formula in the AGZ paper doesn't look like any "UCT variant". 
> Formula from paper: Cpuct * P(s,a) * sqrt(Sum(N(s,b))) / (1 + N(s,a)) Note 
> that there is no logarithmic term, and the division by N+1 falls outside the 
> sqrt. For comparison, a normal UCT term looks like sqrt(ln(sum(N(s,b))) / (1 
> + N))
> 
> Since I asked my question, I found that other people have also noticed a 
> discrepancy. I saw a post on a DeepChem board about this subject. I also 
> found a paper 
> (https://webdocs.cs.ualberta.ca/~mmueller/ps/2016/2016-Integrating-Factorization-Ranked.pdf)
>  by our old friends Chenjun Xiao and Martin Muller:
> 
> "We apply a variant of PUCT [11] formula which is used in AlphaGo [12] to 
> integrate FBT knowledge in MCTS. " But then the formula that they give 
> differs: argmax((Q(s,a) + Cpuct * P(s,a) * sqrt( lg(N(s)) / (1 + N(s,a)))
> 
> I am guessing that Chenjun and Martin decided (or knew) that the AGZ paper 
> was incorrect and modified the equation accordingly.
> 
> Anyone remember anything about this?
> 
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] PUCT formula

2018-03-09 Thread Gian-Carlo Pascutto
On 09-03-18 18:03, Brian Sheppard via Computer-go wrote:

> I am guessing that Chenjun and Martin decided (or knew) that the AGZ
> paper was incorrect and modified the equation accordingly.
> 

I doubt it's just the paper that was incorrect, given that the formula
has been given without log already in the original Alpha Go Lee Sedol paper.

Of course it would be funny if it was a mistake and just got copy pasted.

I never tried "fixing" the formula, I just tried the original (with
priors), and what they gave, and strength was rather similar.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] PUCT formula

2018-03-09 Thread Brian Sheppard via Computer-go
However, the formula in the AGZ paper doesn't look like any "UCT variant". 
Formula from paper: Cpuct * P(s,a) * sqrt(Sum(N(s,b))) / (1 + N(s,a)) Note that 
there is no logarithmic term, and the division by N+1 falls outside the sqrt. 
For comparison, a normal UCT term looks like sqrt(ln(sum(N(s,b))) / (1 + N))

Since I asked my question, I found that other people have also noticed a 
discrepancy. I saw a post on a DeepChem board about this subject. I also found 
a paper 
(https://webdocs.cs.ualberta.ca/~mmueller/ps/2016/2016-Integrating-Factorization-Ranked.pdf)
 by our old friends Chenjun Xiao and Martin Muller: 

"We apply a variant of PUCT [11] formula which is used in AlphaGo [12] to 
integrate FBT knowledge in MCTS. " But then the formula that they give 
differs: argmax((Q(s,a) + Cpuct * P(s,a) * sqrt( lg(N(s)) / (1 + N(s,a)))

I am guessing that Chenjun and Martin decided (or knew) that the AGZ paper was 
incorrect and modified the equation accordingly.

Anyone remember anything about this?

-Original Message-
From: Computer-go <computer-go-boun...@computer-go.org> On Behalf Of Gian-Carlo 
Pascutto
Sent: Friday, March 9, 2018 4:48 AM
To: computer-go@computer-go.org
Subject: Re: [Computer-go] PUCT formula

On 08-03-18 18:47, Brian Sheppard via Computer-go wrote:
> I recall that someone investigated this question, but I don’t recall 
> the result. What is the formula that AGZ actually uses?

The one mentioned in their paper, I assume.

I investigated both that and the original from the referenced paper, but after 
tuning I saw little meaningful strength difference.

One thing of note is that (IIRC) the AGZ formula keeps scaling the exploration 
term by the policy prior forever. In the original formula, it is a diminishing 
term.

--
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] PUCT formula

2018-03-09 Thread Gian-Carlo Pascutto
On 08-03-18 18:47, Brian Sheppard via Computer-go wrote:
> I recall that someone investigated this question, but I don’t recall the
> result. What is the formula that AGZ actually uses?

The one mentioned in their paper, I assume.

I investigated both that and the original from the referenced paper, but
after tuning I saw little meaningful strength difference.

One thing of note is that (IIRC) the AGZ formula keeps scaling the
exploration term by the policy prior forever. In the original formula,
it is a diminishing term.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] PUCT formula

2018-03-08 Thread Brian Sheppard via Computer-go
In the AGZ paper, there is a formula for what they call “a variant of the PUCT 
algorithm”, and they cite a paper from Christopher Rosin: 
http://gauss.ececs.uc.edu/Workshops/isaim2010/papers/rosin.pdf

 

But that paper has a formula that he calls the PUCB formula, which incorporates 
the priors in a different way.

 

And there is something called the PUCT algorithm, from our old friend Olivier 
Teytaud (et al): https://hal.inria.fr/hal-00835352/document/, but it is not 
about incorporating prior probabilities. It is about progressive widening in a 
provably consistent way.

 

I recall that someone investigated this question, but I don’t recall the 
result. What is the formula that AGZ actually uses?

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go