Hello, At last we managed to solve the problem! The solution is maybe a bit hacky but for now, it works.
After further investigations, it seems that the problem was not from the simulator itself but with the fact that it simulates contact which are very sensitive to even a small difference in the input actions. I discovered that pytorch (and maybe other dependencies) has a reproducibility problem of order 1e-5 when on AVX512 compared to AVX2. I first tried to solve the problem by disabling AVX512 at the level of pytorch, but it did not work. The dev of pytorch said that it may be because some components dispatch computation to MKL-DNN, I tried to disable AVX512 on MKL, and still the results were not reproducible, I also tried to deactivate in openmpi without success. I finally concluded that there was a problem with AVX512 somewhere in the dependencies graph but I gave up identifying where, as this seems very complicated. Instead, I found a tool https://github.com/twosigma/libvirtcpuid/ which allows me to mask avx512 from the process and this worked! I was able to use it to modify glibc with a graft in the guix shell command to disable AVX512 in a guix shell command and get the exact same result on both AVX512 and non-AVX512 computers without much of an overhead (there is no vm, the only difference seems to be a slight acceleration when using AVX512 as expected). I guess all of this should be a cautionary tale that sometimes it may be needed to look carefully at the cpu flags in order to get reproducibility with guix shell. Ideally I think when we want to have something reproducible, we may want to also communicate the cpu flags in addition to the manifest and channels file (or at least test that changing the flags does not change the results). To get reproducible results, we packaged libvirtcpuid with guix but in a pretty hacky way that works only if called through `guix shell -CF` in order to recover a FHS filesystem. It would be great if someday a feature to mask some CPU flag made its way to guix shell in order to improve reproducibility but I guess my case of having a big difference due to AVX512 is a limit case that does not happen often (?). Best, Timothée ----- Mail original ----- > De: "Ludovic Courtès" <ludovic.cour...@inria.fr> > À: "Timothee Mathieu" <timothee.math...@inria.fr> > Cc: "Etienne B. Roesch" <etienne.roe...@gmail.com>, "Andreas Enge" > <andr...@enge.fr>, "Steve George" > <st...@futurile.net>, "Cayetano Santos" <csant...@inventati.org>, "help-guix" > <help-guix@gnu.org> > Envoyé: Mercredi 28 Mai 2025 16:14:27 > Objet: Re: Reproducibility of guix shell container across different host OS > Hi, > > Timothee Mathieu <timothee.math...@inria.fr> writes: > >> We finally managed to prove that the problem was with avx-512 by using >> qemu we can enable/disable avx-512 and do the computation with exactly >> the same guix pack and recover that this gives different results. The >> qemu avx-512 results match bitwise the results from laptop on Ubuntu >> that have avx-512 and conversely that the qemu without avx-512 have >> the same results as the Arch laptop that also does not have AVX-512. > > Are you saying that the same binaries in the same pack use AVX-512 when > available and don’t use it otherwise? > > This is the “ideal” load-time adjustment¹ but then you could run into > the kind of numerical issue that you experience. It’s a problem that I > would discuss with the authors of the library, perhaps starting with > mujoco itself. > > Interesting case anyway! > > Ludo’. > > ¹ Discussed in > <https://hpc.guix.info/blog/2018/01/pre-built-binaries-vs-performance/> > and used by libraries like glibc, OpenBLAS, and more.