Dear all,

First of all thank you for the wonderful engine. I am the developer of 
OpenGammon and I am using GnuBG for the analysis of matches (and positions) on 
the platform. Currently I am working on a way to distribute the analysis, so 
users can support OG by 'donating' compute time. This is done through a docker 
image that runs an analysis client.

When a client asks for an analysis task, the server will serve them the match 
file, SGF, and a configuration. Upon completion the client returns the SGF 
(with analysis), and the server does the rest of the processing.

Since the clients might not be 100% trustworthy I implemented a system where 
clients can be served tasks that have already been analysed, to check if their 
analysis is 'correct'. Simply said, this system takes the known analysis and 
compares it to the submitted analysis and reports if they are equivalent. For 
floats I naturally use an epsilon to compare (abs(a-b) < epsilon) to allow for 
small differences due to floating point operations. I initially set this 
epsilon to 0.001, which seemed to be plenty to allow for floating point 
problems.

The problem I encounter is that if the client and server run on different CPU 
architectures (AMD/Intel/ARM) the result can vary wildly. I ran some tests, and 
between AMD and ARM using an epsilon of 0.020 results in a lot of (>50%) 
'corrupt' analyses. I checked the gnubg versions, neural nets, bearoff 
databases, etc and everything seems to be the same between the server and the 
client. 

Locally everything works as expected (so running the client and server on the 
same machine). I run the alpha server on a raspberry pi 5, and if I run the 
client on my macbook the analysis works as expected. The production server runs 
on a AMD Ryzen CPU, and if use an Intel based client it fails, if I use an AMD 
based client (through AWS) everything works as expected.
 
As you can imagine this is surprising behaviour. I would love to hear your 
thoughts, and if you need more information I am more than happy to help with 
debugging. I have quite some knowledge on HPC, working with floating point 
computations as well as SIMD, etc. If you would ask me I would first look into 
SIMD, as that is the most likely place for slight differences between the 
different implementations. It might also just be a configuration issue, but I 
am starting to doubt that.

All the best,

Eran


Reply via email to