Re: [Computer-go] win rate bias and CLOP

Brian Sheppard Tue, 03 Jan 2012 14:38:20 -0800

I am very enthusiastic about CLOP tuning. I overcame some roadblocks along
the way that I want to share with you.

You want CLOP to optimize strength, but it is actually optimizing Strength +
Luck + Avoidance + Exploitation. Using CLOP effectively requires mitigating
the last three factors.

BTW, I imagine that "CLOP" could be any fully automated parameter tuning
solution. That is, nothing here is really specific to CLOP. It just happens
that CLOP is the first fully automated parameter tuning system that I have
made to work.

LUCK
----
Remi has diagnosed the Luck factor: the win rate of the optimal setting is
probably overestimated. This is not a big deal, provided that tuning runs go
long enough for average and optimum win rates to be close together. (E.g., a
few rating points.)

If you change parameters to the ones recommended by CLOP, then the next CLOP
run might claim that your program is weaker than before. This is just the
luck vanishing, so I am not tempted to revert parameter settings in such
cases.

A more subtle point is that that CLOP does not measure the parameter
combination that it recommends. The recommendation is a weighted average of
points that it has measured, which converges (we think) to the optimum when
the win-rate is a smooth function of the parameters. Actual performance can
vary from projected, especially if the win-rate is not a smooth function.
For example, Pebbles tunes against 2 opponents using 105 starting positions
that are played with each color, making 420 initial situations. Both
opponents and Pebbles are pseudo-random, so there is a great deal of variety
from each initial situation. Still, you can imagine that the win rate is not
entirely a smooth function of parameters. Maybe if you change a parameter by
a small amount, then 10 similar initial situations will switch from wins to
losses. 

That is just a reflection of how tuning is performed, so I just accept the
recommended changes. Doing anything else would drive me crazy. Instead, I
just accept that additional tuning will probably improve results.

Basically: do sufficiently long runs (maybe 20K to 30K games when tuning 2
or 3 parameters?) until Average < Optimal < Average + K rating points (maybe
K = 5?). And then blindly accept the new parameters without worrying about
it.

AVOIDANCE
---------
"Avoidance" means: your program has weaknesses, and CLOP can tune parameters
so that weaknesses are less likely to trigger.

Avoidance makes your program stronger to some extent. That is, by avoiding
weaknesses you can play better. But there are obvious limitations, as you
cannot expect opponents to cooperate, but your search engine will use your
own play to model the opponent.

Pebbles has about 60 parameters, and I tuned them in groups of 2 or 3
parameters for about a dozen runs. Pebbles win rate rose from 47% to 55%. I
was supremely happy, because that would be a good year and CLOP did it in
just a month.

But then I integrated some bug fixes, and the win rate dropped to 48%. What
went wrong?

For several weeks I was convinced that I had broken something, but I was
unable to find anything. I verified every change using diffs, and I restored
parameters, but was unable to make the win rate return to 56%.

What I think happened is that tuning had tweaked Pebbles out to such an
extent that it was now very sensitive to perturbations. There are 420
starting situations and ~60 parameters, so tuning each parameter just has to
"switch" a few wins to push the win rate really high. I had fixed a few bugs
that I discovered while the tuning was going on, and that was enough of a
perturbation to make the win rate plummet.

Figuring this out took a long time, so I now have a "don't make yourself
crazy" policy here, too. I fix bugs as I find them, and integrate bug fixes
into the tuning version ASAP.

Now if the win rate drops suddenly, then there is an excellent chance that
my last change was incorrect. But it hasn't been very long, so that's easy
to find.

BTW, tuning makes bugs easier to find. Your program is usually operating
with carefully selected parameters. The tuning process creates an altogether
different distribution of positions, which tends to expose logical errors.
The result is that I have an endless supply of bugs to fix.

EXPLOITATION
------------
Exploitation: tuning will create situations that the opposition handles
badly. There is the same potential for good and bad as Avoidance, but the
potential is reduced by having multiple training opponents.

Pebbles trains against two opponents, and I will add others as Windows
builds become available. Is it true that Pachi was tuned against Fuego,
which was tuned against Mogo, which was tuned against GnuGo? If so, then
game theory suggests that tuning against all of them will make Pebbles less
susceptible to Exploitation and Avoidance defects. 

Using some self-play games in tuning should also help reduce both Avoidance
and Exploitation. I have not tested that because using self-play in CLOP
requires some additional features.

RESULTS
-------
Pebbles is winning almost 57% of its tuning games at present, just 2 months
after initiating CLOP tuning.

For comparison: for about 2 years I basically did not tune parameters at
all. Then I spent about 6 months trying to tune parameters using semi-manual
methods, which raised win-rate from 43% to 47%.

You can see why I am enthusiastic about this new way of working: 1/3 of the
time, and 3 times the benefit.

Note that finding bugs is a large part of the benefit. Not just finding, but
fixing and then tuning. All three processes are aided (i.e., faster and more
effective) by integrating into an automated tuning loop.

Hope this helps,
Brian

-----Original Message-----
From: computer-go-boun...@dvandva.org
[mailto:computer-go-boun...@dvandva.org] On Behalf Of Rémi Coulom
Sent: Tuesday, January 03, 2012 10:19 AM
To: computer-go@dvandva.org
Subject: [Computer-go] win rate bias and CLOP

It is important to understand that CLOP claims very little in terms of win
rate. That is to say the win rate estimates it reports are all biased. Win
rate over all samples underestimates the real win rate. Win rate near the
maximum (central, and weighted) tend to be over-estimated.

CLOP finds the location in parameter space that has the highest win rate. It
may be the highest because it is the best, but also because it is the most
lucky. That's why it is necessarily biased toward optimistic values.

If the win rate over all samples is an improvement, then you can be sure you
have an improvement. Otherwise you cannot be sure unless you actually play a
lot of games with the suggested parameters.

Rémi

On 3 janv. 2012, at 14:09, Ingo Althöfer wrote:

> Hi David,
> 
> David Fotland on CLOP-optimization:
>> I tried it, but got no benefit so far.  It claimed to find better
settings
>> for most parameters, but when I used them the program wasnt any
>> stronger.
> 
> Interestant. Had it similar strength or did it even become weaker?
> How often did the move proposals by your "older" ManyFaces and the
> CLOP-MF differ?
> 
> Ingo.
> -- 
> NEU: FreePhone - 0ct/min Handyspartarif mit Geld-zurück-Garantie!

> Jetzt informieren: http://www.gmx.net/de/go/freephone
> _______________________________________________
> Computer-go mailing list
> Computer-go@dvandva.org
> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

_______________________________________________
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

_______________________________________________
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] win rate bias and CLOP

Reply via email to