Re: [Computer-go] Rating stability study

Don Dailey Mon, 29 Aug 2011 07:41:10 -0700

I have no doubt there are non-transitive rating relationships between
programs,  but it's difficult to improve on that problem without more
control over scheduling.    It is far worse on servers where you get to pick
and choose your opponents.    On the chess servers human players have been
able to jack their ratings up to ridiculous levels by picking and choosing
who they play.

There are several potential improvements, but they all have
some undesirable issues associated with them.    I have considered some of
them in the past.

One is the choose the openings for the programs.     This eliminates one
source of intransitivity but I don't think anyone wants the server to choose
their openings for them.    In the chess world rating lists are computed
this way and it works well - this ensures that programs play a wide variety
of different games and that games don't repeat.

Another idea is to reduce the influence on the rating of multiple matches
between the same opponent.  If your rating is based on 90% of the games
being played against a single opponent,  the other 10% should get much more
weight.      This is difficult to manage however and does not guarantee that
an opponent is not simply renamed or that you are playing several opponents
who play in a very similar way.    And also,  it's just odd.

Really what you want in general is to play the widest variety of opponents
possible - but that is usually difficult to arrange.

There is no general solution because the ELO formula (and all ranking
formulas that I know of)  is based on an assumption that is not correct -
that playing strength can be represented by a single number and that ratings
are transitive.

Don

On Mon, Aug 29, 2011 at 10:14 AM, Brian Sheppard <[email protected]> wrote:

> Pebbles rating on 9x9 CGOS swung wildly last week, from 2310 up to 2400 and
> down to 2340 or so. The magnitude of the swing seemed odd to me, so I
> investigated.
>
> It turns out that Pebbles rating depends heavily on whom it is playing.
>
> Versions of Valkyria and Mogo had the following winning percentages against
> Pebbles:
>
>        Valkyria3.5.9_P4Bx (2641)       138/180 = 76.7% (Some games against
> Valkyria were played before last week)
>        _Mogo3MC90K,p (2393)            8/20 = 40.0%
>        _Mogo3MC30K,p (2274)            21/86 = 24.4%
>
> Pebbles combined performance rating against these programs was 2447, on 286
> games.
>
> In contrast, Fuego and Aya won very high percentages against Pebbles:
>
>        Fuego-1502M-1c (2621)   14/16   = 87.5%
>        Fuego-1502-1c (2602)    26/28   = 92.9%
>        Fuego-0.4.1,p (2492)    21/24   = 87.5%
>        Aya727j_10k (2291)      32/53   = 60.4%
>
> The combined performance rating of Pebbles in these games is just 2207, on
> 121 games.
>
> There is zero probability that a 2207 will play at 2447 level for 286
> games,
> or that 2447 will play at 2207 level for 121 games. But note that there is
> hard-to-quantify selection bias in the way these ratings are calculated.
>
> Pebbles run from 2310 to 2400 coincided with a period when no versions of
> Fuego or Aya were running, but two Mogos and one Valkyria were running.
> Then
> four versions of Fuego plus an Aya logged in, and Pebbles rating dropped
> fast.
>
> A few years ago I had the impression that Pebbles did relatively well
> against Fuego, but could not win a game against Mogo. So this relationship
> has changed over time.
>
> I haven't tuned Pebbles to play well against any specific opponents. I
> doubt
> that anyone is targeting Pebbles.
>
> It is clear that some programs frequently play specific openings. If
> Pebbles
> plays an opening badly, then it would lose a lot. The result could also
> arise from differences in understanding that push the game in a specific
> direction.
>
> One notable non-transitive relationship is that
>
>        _Mogo3MC90K,p won 70% against Aya727j_10k (note: only 20 games, but
> result is consistent with rating)
>        Aya727j_10k won 60% against Pebbles
>        Pebbles won 75% against _Mogo3MC90K,p
>
> My conclusion is that I am glad that I stopped using CGOS ratings as a way
> to measure short-term progress. At a minimum I would have to re-weight
> games
> to represent a consistent opposition profile.
>
> Brian
>
>
> _______________________________________________
> Computer-go mailing list
> [email protected]
> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
>

_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] Rating stability study

Reply via email to