This is in response to a few posts about the "self-test" effect in ELO
rating tests.

I'll start by claiming right up front that I don't believe, for certain
types of programs, that this is something we have to worry unduly
about.     I'll explain why I feel that way in a moment.

One general observation is that if you test 2 programs like we are
doing,  even though they are different programs with different authors, 
one has to admit that they have much in common.   The are both based on
Monte Carlo simulations and they both have a global tree search
component that is critical to their success.    Although the details
differ and one implementation is superior to the other,  they are the
same basic type of program.       So one might conclude this is a big
self-test experiment and not fully valid for that reason.    

In the current test,  Mogo stands alone at the top,   FatMan fails to
give it any competition at the upper levels for whatever reason.     So
one might also conclude this is a big Mogo self-test of scalability.  

It's natural to ask the question,  "if mogo continues to show
improvement with increasing power,  is it really stronger or is it just
stronger against lower powered versions of itself?"      Another way to
ask this questions is, "does the apparent improvement hold against other
programs?"   Or at least does it hold to the same extent?

It's possible to take a given program such as "Gnugo" and build a
program designed solely to beat it.   In fact some have claimed or
proposed that by tuning their program against Gnugo,  they have
succeeded in making it play really well against Gnugo,  but have
improved it very little against other programs.    I think David Doshay
has made this assertion or something very similar to it based on the
fact that it is based on using Gnugo as a plausible move generator and
evaluation function.

This is a real effect which I don't question.   I have done this
myself.    It happens when you tune your improvement against a specific
programs weaknesses (and/or strengths.)     You can make a program
actually play weaker in general, but stronger against a specific
opponent by specially tuning it a certain way.     You can do this by
making it ignore things the other program cannot take advantage of.  I
used to do that myself in chess,    I could beat this weak player every
time in less than 10 moves (just for fun)  but to do it I had to
actually play moves that were dubious at best.  

But this is not the same as building a program that is scalable using
sound principles.   It's one thing to tune a program in a very specific
way to do some things better (usually by sacrificing to some extent one
of its strengths),  but it's quite another to build a program with a
general mechanism that improves it's play in all areas without
sacrificing any skill in another area.

A thought experiment.   Suppose a version of Mogo at a very high level
setting can beat a lower level setting of the same Mogo 99% of the time.
    Is it likely this improvement is artificial and wouldn't apply to
it's games against other programs?     No.   This is a real improvement
because it's not based on making it weaker in some area to take
advantage of a specific flaw.    Is it possible that it doesn't scale
quite as much against other programs?   Yes, that is likely.   When a
program thinks exactly like another program,  just deeper,  it's almost
like it has the ability to read the mind of the other program.     In
computer chess if you look 2 ply deeper against an identical program, 
you not only see exactly what he "will see" on the next move,  but you
also see an additional move deeper.   This does give an advantage
because if you overlook something important,  so did the other program! 
  You won't miss something that he can beat you over the head with.     

Nevertheless, you can only stretch this so far.   If your superiority is
substantial with a properly scalable program,  it will be represented as
a substantial improvement against any other program too.      If your
superiority against a specific opponent is based on specific trickery
(tuning) to beat that specific opponent,  then it may not translate to
other opponents.    It's like a fighter who leaves himself open to a
left hook because he knows his opponent doesn't have a left hook a
strategy that is unsound against another opponent, but not the one he
faces now.

So my assertion is that scalability based on sound principles is more or
less universal with perhaps a small amount of  self-play distortion, but
nothing to get too excited about.

- Don
      



_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to