Hi Easybuilders,

I am trying to get the newest PyTorch to work with the foss/2023a toolchain.  
So far, the CPU-only version seems to work (PR #19184), and I am trying to get 
the CUDA version to work.

In the test suite, a lot of stuff fails with 
  fi_info: error while loading shared libraries: libpsm_infinipath.so.1: cannot 
open shared object file: No such file or directory

It looks like it is trying to load an OmniPath driver.  However, our few GPU 
nodes have neither OmniPath nor Infiniband, but they do share the EasyBuild 
module with the CPU nodes, and of course OpenMPI is built with OmniPath support 
(which does not normally cause trouble for MPI jobs that stay within a single 
node, and this is our use case on the GPU nodes).

I suspect the problem is due to PyTorch depending on NCCL and magma, and both 
of these depend on UCX-CUDA.  Does anyone have a suggestion for how to handle 
this?  Can one build a version of NCCL and magma without UCX-CUDA?  Can one 
disable it with an environment variable?  Or something else?

We would rather avoid having a completely different module tree for these few 
nodes, but if it is necessary then we will have to.

Best regards

Jakob


Reply via email to