Hi Timothee,

On Fri, Jun 13, 2025 at 06:38:17PM +0200, Timothee Mathieu wrote:
> Hello,
> 
> At last we managed to solve the problem! The solution is maybe a bit hacky 
> but for now, it works.
> 
> After further investigations, it seems that the problem was not from the 
> simulator itself but with the fact that it simulates contact which are very 
> sensitive to even a small difference in the input actions. I discovered that 
> pytorch (and maybe other dependencies) has a reproducibility problem of order 
> 1e-5 when on AVX512 compared to AVX2. I first tried to solve the problem by 
> disabling AVX512 at the level of pytorch, but it did not work. The dev of 
> pytorch said that it may be because some components dispatch computation to 
> MKL-DNN, I tried to disable AVX512 on MKL, and still the results were not 
> reproducible, I also tried to deactivate in openmpi without success.
> I finally concluded that there was a problem with AVX512 somewhere in the 
> dependencies graph but I gave up identifying where, as this seems very 
> complicated.
> 
> Instead, I found a  tool https://github.com/twosigma/libvirtcpuid/ which 
> allows me to mask avx512 from the process and this worked! I was able to use 
> it to modify glibc with a graft in the guix shell command to disable AVX512 
> in a guix shell command and get the exact same result on both AVX512 and 
> non-AVX512 computers without much of an overhead (there is no vm, the only 
> difference seems to be a slight acceleration when using AVX512 as expected). 
> 
> I guess all of this should be a cautionary tale that sometimes it may be 
> needed to look carefully at the cpu flags in order to get reproducibility 
> with guix shell. Ideally I think when we want to have something reproducible, 
> we may want to also communicate the cpu flags in addition to the manifest and 
> channels file (or at least test that changing the flags does not change the 
> results). 
> 
> To get reproducible results, we packaged libvirtcpuid with guix but in a 
> pretty hacky way that works only if called through `guix shell -CF` in order 
> to recover a FHS filesystem. It would be great if someday a feature to mask 
> some CPU flag made its way to guix shell in order to improve reproducibility 
> but I guess my case of having a big difference due to AVX512 is a limit case 
> that does not happen often (?).

Well congratulations in getting to the bottom of it and achieving your goal.

Sounds like it's a complex problem, and illustrates how difficult 
reproducibility is!

Steve / Futurile

Reply via email to