Very interesting. Is it the case that mujoco is packaged correctly in guix, but then itself calls different routines depending on the running architecture? (or alternatively, it wouldn't be packaged "correctly" (or not at all!) and be compiled with different flags on different architectures, .. then I think that would have shown in your investigation of diff)
Etienne On Wed, May 14, 2025 at 8:45 AM Timothee Mathieu <timothee.math...@inria.fr> wrote: > Hello, > > After a lot of experimentations and discussion with colleagues, I found > that the culprit! It seems to be AVX-512. Apparently, the physics behind my > simulator uses AVX (cf > https://mujoco.readthedocs.io/en/stable/programming/index.html). > The result of my script is different on a computer that has AVX-512 > compared to one that does not have it (as verified through lscpu). > > I am not super familiar with such low level instructions, but I verified > that on three separate AVX-512 computers I got the same result and on 5 > separate non AVX-512 I got the other result. > > I am not sure if I understand everything about AVX, I tried to tune the > compilation to CPU without AVX with > https://hpc.guix.info/blog/2022/01/tuning-packages-for-a-cpu-micro-architecture/ > in order to get reproducible results, but it did not work, maybe because > only a few of the dependency packages are tunable. Is there a way to force > everything to use AVX and not AVX-512? I understand that AVX-512 is meant > to be faster but I think in my case before being faster I want to see if it > is possible to be reproducible. > > Thanks, > Timothée > > > ----- Mail original ----- > > De: "Timothee Mathieu" <timothee.math...@inria.fr> > > À: "Andreas Enge" <andr...@enge.fr> > > Cc: "Ludovic Courtès" <ludovic.cour...@inria.fr>, "Steve George" < > st...@futurile.net>, "Cayetano Santos" > > <csant...@inventati.org>, "help-guix" <help-guix@gnu.org> > > Envoyé: Mercredi 7 Mai 2025 09:34:44 > > Objet: Re: Reproducibility of guix shell container across different host > OS > > > I checked and I am now convinced that the fault lies in the physics > simulator as > > I tried on other simpler reinforcement learning environments and > everything was > > reproducible, so it is not due to the neural network part (which is > already > > impressive I guess as neural network libraries tend to be quite a mess > > reproducibility-wise). > > > > So it seems that something weird is going on with mujoco, the physics > simulator > > for which we did a package. And it seems that it is the interaction > between > > mujoco and the neural network from pytorch because using random action > seems > > reproducible. > > I guess this could be due to floating point rounding error, although the > > difference seems to be huge for this to be rounding error. The > computation is > > quite long so maybe the errors amplify, but I am a bit doubtful about > this > > because I found a complete reproducibility between my laptop and some > powerful > > servers with very different hardware, wouldn't the results be different > with > > very different hardware if the problem was rounding error? > > > > Is there a way to check whether this is due to floating point calculation > > rounding error? I tried to use Float64 instead of Float 32 and it does > not > > change that I have non-reproducible results (although it changes the > value a > > little bit, in the scale of 10^{-5}). > > > > Thanks, > > Timothée > > > > ----- Mail original ----- > >> De: "Andreas Enge" <andr...@enge.fr> > >> À: "Ludovic Courtès" <ludovic.cour...@inria.fr> > >> Cc: "Timothee Mathieu" <timothee.math...@inria.fr>, "Steve George" > >> <st...@futurile.net>, "Cayetano Santos" > >> <csant...@inventati.org>, "help-guix" <help-guix@gnu.org> > >> Envoyé: Mardi 6 Mai 2025 10:30:12 > >> Objet: Re: Reproducibility of guix shell container across different > host OS > > > >> Am Tue, May 06, 2025 at 09:26:51AM +0200 schrieb Ludovic Courtès: > >>> Do you have evidence that the problem is a leak like this? Or could it > >>> be that the Python code being run is non-deterministic? > >>> If you run ‘guix shell -CN --no-cwd coreutils’, you can see with ‘ls’ > >>> etc. that nothing leaks from the host OS (apart of course from the > >>> kernel). > >> > >> Or maybe the hardware "leaks"? Are the two machines exactly identical, > >> in particular, do they have the exact same processor? Since the > >> differences involve floating point computations, I would not be > >> surprised if the precise processor architecture made a difference. > >> > >> Someone mentioned the IEEE-754 standard in the thread, which mandates > >> that basic arithmetic operations follow a precise, deterministic > >> semantics, but not necessarily trigonometric functions. > >> > >> Also, if I remember well, special flags are required to make GCC emit > >> IEEE conforming code; otherwise the old, but faster x86 80 bit extended > >> precision built into the processor is used. I have seen a case where > >> *printing* a variable changed its value, because this meant it would be > >> moved from an 80 bit processor register to a 64 bit memory location. > >> Otherwise said, something like the following code: > >> double x = ...; > >> if (x!=some value) { > >> printf ("%f", x); > >> if (x!=some value) // the same value as above, of course > >> printf ("0"); > >> else > >> printf ("1"); > >> } > >> would print x, followed by "1"... > >> > >> See this thread: > >> https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00277.html > >> and commit 098bd280f82350073e8280e37d56a14162eed09c . > >> > >> If you want deterministic, reproducible floating point computations, > >> I am afraid you would need to use the (comparably slow in low precision) > >> GNU MPFR and GNU MPC libraries; or use interval arithmetic from FLINT > >> and replace exact comparisons by looking at intersections of intervals. > >> > > > Andreas > >