On Tue, Jan 28, 2025 at 11:45:11AM +0200, Eran Lambooij wrote:
> [...] For floats I naturally use an epsilon
> to compare (abs(a-b) < epsilon) to allow for small differences due to
> floating point operations. I initially set this epsilon to 0.001,
> which seemed to be plenty to allow for floating point problems.
>
> The problem I encounter is that if the client and server run on
> different CPU architectures (AMD/Intel/ARM) the result can vary
> wildly. I ran some tests, and between AMD and ARM using an epsilon of
> 0.020 results in a lot of (>50%) 'corrupt' analyses. I checked the
> gnubg versions, neural nets, bearoff databases, etc and everything
> seems to be the same between the server and the client.
Could you explain what a and b you compare ? The final "Error total EMG"
values only ? Or more of them, up to the equity of every best move for
instance and declare the analysis corrupt if any of them differ too much ?
The factors you mention below, and some others, will indeed lead to
different results. In the case of the final error total I would guess
that a discrepancy of more than 0.001 is very common but 0.020 quite
rare.
> Locally everything works as expected (so running the client and server
> on the same machine). I run the alpha server on a raspberry pi 5, and
> if I run the client on my macbook the analysis works as expected. The
> production server runs on a AMD Ryzen CPU, and if use an Intel based
> client it fails, if I use an AMD based client (through AWS) everything
> works as expected.
>
> As you can imagine this is surprising behaviour. I would love to hear
> your thoughts, and if you need more information I am more than happy
> to help with debugging. I have quite some knowledge on HPC, working
> with floating point computations as well as SIMD, etc. If you would
> ask me I would first look into SIMD, as that is the most likely place
> for slight differences between the different implementations. It might
> also just be a configuration issue, but I am starting to doubt that.
Scalar vs. 4-way SIMD like SSE2 or NEON vs. 8-way AVX will cause small
differences.
Another factor is that by default gnubg is built with the -ffast-math
option. You could try to build it without it and see how much it helps
for accuracy. It would be slower but not by too much.
In the case of ARM vs. x86 you may want to look at this block of code in
lib/neuralnetsse.c, add one "rec = vmulq_f32(...);" and see if it helps.
If you're familiar with numerical computation on these platforms you may
even already know the answer to the comment issue.
/* TODO: Check how many Newton-Raphson iterations are needed to match x86 rc
p and div accuracy */
#ifdef __FAST_MATH__
rec = vrecpeq_f32(x1);
return vmulq_f32(vrecpsq_f32(x1, rec), rec);
#else
rec = vrecpeq_f32(x1);
rec = vmulq_f32(vrecpsq_f32(x1, rec), rec);
return vmulq_f32(vrecpsq_f32(x1, rec), rec);
#endif
Another source of differences is that for sorting moves the libc qsort()
function is used. It may differ slightly from OS to OS and is instable
anyway.
This may lead to different choices of moves when there are exactly equal
equities (in late bearoff or earlier if moves a followed by a
double/pass).
It might be useful to use another, stable, algorithm that may even be
faster if it is more suited to the kind of short arrays of moves it
would be applied.