Hi Timothee, On Fri, Jun 13, 2025 at 06:38:17PM +0200, Timothee Mathieu wrote: > Hello, > > At last we managed to solve the problem! The solution is maybe a bit hacky > but for now, it works. > > After further investigations, it seems that the problem was not from the > simulator itself but with the fact that it simulates contact which are very > sensitive to even a small difference in the input actions. I discovered that > pytorch (and maybe other dependencies) has a reproducibility problem of order > 1e-5 when on AVX512 compared to AVX2. I first tried to solve the problem by > disabling AVX512 at the level of pytorch, but it did not work. The dev of > pytorch said that it may be because some components dispatch computation to > MKL-DNN, I tried to disable AVX512 on MKL, and still the results were not > reproducible, I also tried to deactivate in openmpi without success. > I finally concluded that there was a problem with AVX512 somewhere in the > dependencies graph but I gave up identifying where, as this seems very > complicated. > > Instead, I found a tool https://github.com/twosigma/libvirtcpuid/ which > allows me to mask avx512 from the process and this worked! I was able to use > it to modify glibc with a graft in the guix shell command to disable AVX512 > in a guix shell command and get the exact same result on both AVX512 and > non-AVX512 computers without much of an overhead (there is no vm, the only > difference seems to be a slight acceleration when using AVX512 as expected). > > I guess all of this should be a cautionary tale that sometimes it may be > needed to look carefully at the cpu flags in order to get reproducibility > with guix shell. Ideally I think when we want to have something reproducible, > we may want to also communicate the cpu flags in addition to the manifest and > channels file (or at least test that changing the flags does not change the > results). > > To get reproducible results, we packaged libvirtcpuid with guix but in a > pretty hacky way that works only if called through `guix shell -CF` in order > to recover a FHS filesystem. It would be great if someday a feature to mask > some CPU flag made its way to guix shell in order to improve reproducibility > but I guess my case of having a big difference due to AVX512 is a limit case > that does not happen often (?).
Well congratulations in getting to the bottom of it and achieving your goal. Sounds like it's a complex problem, and illustrates how difficult reproducibility is! Steve / Futurile