Re: [Computer-go] Rating stability study

Brian Sheppard Mon, 29 Aug 2011 08:29:59 -0700

I don't want any changes. I was simply struck by the scale of the variation,
and thought that this audience would benefit from seeing the details.


 

My offline testing does control the openings and the identities of the
opponents, and ensures that each player plays each side of each opening. I
am able to reliably detect 1% changes with about 1 day of elapsed time. This
is good enough for measuring progress.

 

Brian

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Don Dailey
Sent: Monday, August 29, 2011 10:41 AM
To: [email protected]
Subject: Re: [Computer-go] Rating stability study

 

I have no doubt there are non-transitive rating relationships between
programs,  but it's difficult to improve on that problem without more
control over scheduling.    It is far worse on servers where you get to pick
and choose your opponents.    On the chess servers human players have been
able to jack their ratings up to ridiculous levels by picking and choosing
who they play.   

 

There are several potential improvements, but they all have some undesirable
issues associated with them.    I have considered some of them in the past.


 

One is the choose the openings for the programs.     This eliminates one
source of intransitivity but I don't think anyone wants the server to choose
their openings for them.    In the chess world rating lists are computed
this way and it works well - this ensures that programs play a wide variety
of different games and that games don't repeat.    

 

Another idea is to reduce the influence on the rating of multiple matches
between the same opponent.  If your rating is based on 90% of the games
being played against a single opponent,  the other 10% should get much more
weight.      This is difficult to manage however and does not guarantee that
an opponent is not simply renamed or that you are playing several opponents
who play in a very similar way.    And also,  it's just odd.   

 

Really what you want in general is to play the widest variety of opponents
possible - but that is usually difficult to arrange.  

 

There is no general solution because the ELO formula (and all ranking
formulas that I know of)  is based on an assumption that is not correct -
that playing strength can be represented by a single number and that ratings
are transitive.  

 

Don

 

 

 

 

 

 

On Mon, Aug 29, 2011 at 10:14 AM, Brian Sheppard <[email protected]> wrote:

Pebbles rating on 9x9 CGOS swung wildly last week, from 2310 up to 2400 and
down to 2340 or so. The magnitude of the swing seemed odd to me, so I
investigated.

It turns out that Pebbles rating depends heavily on whom it is playing.

Versions of Valkyria and Mogo had the following winning percentages against
Pebbles:

       Valkyria3.5.9_P4Bx (2641)       138/180 = 76.7% (Some games against
Valkyria were played before last week)
       _Mogo3MC90K,p (2393)            8/20 = 40.0%
       _Mogo3MC30K,p (2274)            21/86 = 24.4%

Pebbles combined performance rating against these programs was 2447, on 286
games.

In contrast, Fuego and Aya won very high percentages against Pebbles:

       Fuego-1502M-1c (2621)   14/16   = 87.5%
       Fuego-1502-1c (2602)    26/28   = 92.9%
       Fuego-0.4.1,p (2492)    21/24   = 87.5%
       Aya727j_10k (2291)      32/53   = 60.4%

The combined performance rating of Pebbles in these games is just 2207, on
121 games.

There is zero probability that a 2207 will play at 2447 level for 286 games,
or that 2447 will play at 2207 level for 121 games. But note that there is
hard-to-quantify selection bias in the way these ratings are calculated.

Pebbles run from 2310 to 2400 coincided with a period when no versions of
Fuego or Aya were running, but two Mogos and one Valkyria were running. Then
four versions of Fuego plus an Aya logged in, and Pebbles rating dropped
fast.

A few years ago I had the impression that Pebbles did relatively well
against Fuego, but could not win a game against Mogo. So this relationship
has changed over time.

I haven't tuned Pebbles to play well against any specific opponents. I doubt
that anyone is targeting Pebbles.

It is clear that some programs frequently play specific openings. If Pebbles
plays an opening badly, then it would lose a lot. The result could also
arise from differences in understanding that push the game in a specific
direction.

One notable non-transitive relationship is that

       _Mogo3MC90K,p won 70% against Aya727j_10k (note: only 20 games, but
result is consistent with rating)
       Aya727j_10k won 60% against Pebbles
       Pebbles won 75% against _Mogo3MC90K,p

My conclusion is that I am glad that I stopped using CGOS ratings as a way
to measure short-term progress. At a minimum I would have to re-weight games
to represent a consistent opposition profile.

Brian


_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] Rating stability study

Reply via email to