Dear all, First of all thank you for the wonderful engine. I am the developer of OpenGammon and I am using GnuBG for the analysis of matches (and positions) on the platform. Currently I am working on a way to distribute the analysis, so users can support OG by 'donating' compute time. This is done through a docker image that runs an analysis client.
When a client asks for an analysis task, the server will serve them the match file, SGF, and a configuration. Upon completion the client returns the SGF (with analysis), and the server does the rest of the processing. Since the clients might not be 100% trustworthy I implemented a system where clients can be served tasks that have already been analysed, to check if their analysis is 'correct'. Simply said, this system takes the known analysis and compares it to the submitted analysis and reports if they are equivalent. For floats I naturally use an epsilon to compare (abs(a-b) < epsilon) to allow for small differences due to floating point operations. I initially set this epsilon to 0.001, which seemed to be plenty to allow for floating point problems. The problem I encounter is that if the client and server run on different CPU architectures (AMD/Intel/ARM) the result can vary wildly. I ran some tests, and between AMD and ARM using an epsilon of 0.020 results in a lot of (>50%) 'corrupt' analyses. I checked the gnubg versions, neural nets, bearoff databases, etc and everything seems to be the same between the server and the client. Locally everything works as expected (so running the client and server on the same machine). I run the alpha server on a raspberry pi 5, and if I run the client on my macbook the analysis works as expected. The production server runs on a AMD Ryzen CPU, and if use an Intel based client it fails, if I use an AMD based client (through AWS) everything works as expected. As you can imagine this is surprising behaviour. I would love to hear your thoughts, and if you need more information I am more than happy to help with debugging. I have quite some knowledge on HPC, working with floating point computations as well as SIMD, etc. If you would ask me I would first look into SIMD, as that is the most likely place for slight differences between the different implementations. It might also just be a configuration issue, but I am starting to doubt that. All the best, Eran
