It's hard to automate XG's side in this hypothetical match-up. I've tried using the "computer use" stuff in AIs to automatically push buttons in the XG app. It doesn't work, because the AI needs to know what position is displayed on the screen, and they're still really bad at counting checkers in an image of a board - they can't reliably identify the position. Plus, it's super slow, and you need 10s of thousands of games to get the statistical accuracy to identify a real difference.
Instead, here's what I settled on. Let's call the bot you're testing (eg gnubg) "bot". - Automatically play hundreds of bot-vs-bot games at whatever eval level you want to test out (eg 3-ply). - For each game, write out a transcript to a file that XG can import. eg the format Backgammon Galaxy uses for its exports. Just a text file - easy to construct automatically. - Now the manual XG step: in the XG desktop app, use the Batch Analyze feature to import all those hundreds of game files and analyze each one, writing out a .xg file next to each game transcript text file. (Maybe you can get the AI to automate this step using computer use? Anyways, it takes only a minute for a human to start this off.) - Parse the XG analytics from each .xg file, then add up the total errors and decisions that XG finds to get an aggregate XG PR for your bot. This will have XG score your bot's play, allowing you to quote an XG PR for it. But that assumes XG is "correct": that you can use its decisions as ground truth. That's not a good assumption if your bot is about as good as XG. So, the next step is to identify the positions where your bot and XG differ, then roll them out. Then use the rollout results as the ground truth and compare both your bot and XG against them. Then you can see which bot is better. Which bot performs the rollout? Presumably your own bot, not XG, since I don't know how to automate a string of XG rollouts. That might bias the results slightly in favor of your bot. In practice, however, I've seen most strong bots agree on most decisions in a proper rollout, so I don't think this introduces a significant bias. The comparison results are stats from those rollouts: the fraction your bot got right, the fraction XG got right, the fraction neither got right, and average equity errors (vs the best rollout decision) for both bots. It's not as clean a comparison as "play 100k money games head to head and identify the average edge," but it's the closest I've come up with. Does anyone have a better, practical, and credible way to compare against XG? On Wed, May 27, 2026 at 12:45 AM MK <[email protected]> wrote: > A recent PRgammon discussion in bgonline, (see: > https://www.bgonline.org/forums/webbbs_config.pl?read=219985), nudged me > to post on a subject that > has been on my mind for many years. > > In how the strengths of bots have been compared, by only looking at the > positions they play > differently and letting one of the bots (the one already assumed to be > stronger) decide which is a > better play, I see the same problem with PRgammon. > > This approach totally ignores that, if played out, the rest of the game > will enfold differently > after that point on. If a play is the better play because it results in > more wins and if GNUBG ends > up winning the game after its "inferior play" (according to XG) then it > will have made the better > play indeed. > > I play against both bots and over the years I observed that I do visibly > better against XG than > GNUBG. Either GNUBG cheats better than XG ;) or there may be other > explanations for it such as that > playing "style" does indeed matter, (as it was mentioned regarding > PRgammon). > > If we accept that equally strong players can have different styles and if > "styles" can define > certain "strategies" of playing the same positions consistently > differently, then we have to accept > that there are more than just one best play at least in different types of > games, (i.e. backgames) > and also in games between players of different "style vs another style". > > With all this, I personally believe that GNUBG is stronger than XG and > that this would be proven if > the bots were made to play a large number of actual games and matches, and > comparing wins/losses > instead of PR's. > > I think this can be done using GNUBG's Python scripting to make > decisions/moves for both bots. > Assuming key strokes can be passed to XG to make it play, its before and > after position ID's can be > read by GNUBG and its play duplicated. I haven't really looked into this. > Has anyone else thought > about this and has any ideas/suggestions on how to go about it? If done, I > think the result of such > a comparison can/will be a "game changer". > > MK > > >
