It's hard to automate XG's side in this hypothetical match-up.

I've tried using the "computer use" stuff in AIs to automatically push
buttons in the XG app. It doesn't work, because the AI needs to know
what position is displayed on the screen, and they're still really bad at
counting checkers in an image of a board - they can't reliably identify the
position. Plus, it's super slow, and you need 10s of thousands of games to
get the statistical accuracy to identify a real difference.

Instead, here's what I settled on. Let's call the bot you're testing (eg
gnubg) "bot".

   - Automatically play hundreds of bot-vs-bot games at whatever eval level
   you want to test out (eg 3-ply).
   - For each game, write out a transcript to a file that XG can import. eg
   the format Backgammon Galaxy uses for its exports. Just a text file - easy
   to construct automatically.
   - Now the manual XG step: in the XG desktop app, use the Batch Analyze
   feature to import all those hundreds of game files and analyze each one,
   writing out a .xg file next to each game transcript text file. (Maybe you
   can get the AI to automate this step using computer use? Anyways, it takes
   only a minute for a human to start this off.)
   - Parse the XG analytics from each .xg file, then add up the total
   errors and decisions that XG finds to get an aggregate XG PR for your bot.

This will have XG score your bot's play, allowing you to quote an XG PR for
it. But that assumes XG is "correct": that you can use its decisions as
ground truth. That's not a good assumption if your bot is about as good as
XG.

So, the next step is to identify the positions where your bot and XG
differ, then roll them out. Then use the rollout results as the ground
truth and compare both your bot and XG against them. Then you can see which
bot is better.

Which bot performs the rollout? Presumably your own bot, not XG, since I
don't know how to automate a string of XG rollouts. That might bias the
results slightly in favor of your bot. In practice, however, I've seen most
strong bots agree on most decisions in a proper rollout, so I don't think
this introduces a significant bias.

The comparison results are stats from those rollouts: the fraction your bot
got right, the fraction XG got right, the fraction neither got right, and
average equity errors (vs the best rollout decision) for both bots.

It's not as clean a comparison as "play 100k money games head to head and
identify the average edge," but it's the closest I've come up with.

Does anyone have a better, practical, and credible way to compare against
XG?


On Wed, May 27, 2026 at 12:45 AM MK <[email protected]> wrote:

> A recent PRgammon discussion in bgonline, (see:
> https://www.bgonline.org/forums/webbbs_config.pl?read=219985), nudged me
> to post on a subject that
> has been on my mind for many years.
>
> In how the strengths of bots have been compared, by only looking at the
> positions they play
> differently and letting one of the bots (the one already assumed to be
> stronger) decide which is a
> better play, I see the same problem with PRgammon.
>
> This approach totally ignores that, if played out, the rest of the game
> will enfold differently
> after that point on. If a play is the better play because it results in
> more wins and if GNUBG ends
> up winning the game after its "inferior play" (according to XG) then it
> will have made the better
> play indeed.
>
> I play against both bots and over the years I observed that I do visibly
> better against XG than
> GNUBG. Either GNUBG cheats better than XG ;) or there may be other
> explanations for it such as that
> playing "style" does indeed matter, (as it was mentioned regarding
> PRgammon).
>
> If we accept that equally strong players can have different styles and if
> "styles" can define
> certain "strategies" of playing the same positions consistently
> differently, then we have to accept
> that there are more than just one best play at least in different types of
> games, (i.e. backgames)
> and also in games between players of different "style vs another style".
>
> With all this, I personally believe that GNUBG is stronger than XG and
> that this would be proven if
> the bots were made to play a large number of actual games and matches, and
> comparing wins/losses
> instead of PR's.
>
> I think this can be done using GNUBG's Python scripting to make
> decisions/moves for both bots.
> Assuming key strokes can be passed to XG to make it play, its before and
> after position ID's can be
> read by GNUBG and its play duplicated. I haven't really looked into this.
> Has anyone else thought
> about this and has any ideas/suggestions on how to go about it? If done, I
> think the result of such
> a comparison can/will be a "game changer".
>
> MK
>
>
>

Reply via email to