Hi Easybuilders, I am trying to get the newest PyTorch to work with the foss/2023a toolchain. So far, the CPU-only version seems to work (PR #19184), and I am trying to get the CUDA version to work.
In the test suite, a lot of stuff fails with fi_info: error while loading shared libraries: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory It looks like it is trying to load an OmniPath driver. However, our few GPU nodes have neither OmniPath nor Infiniband, but they do share the EasyBuild module with the CPU nodes, and of course OpenMPI is built with OmniPath support (which does not normally cause trouble for MPI jobs that stay within a single node, and this is our use case on the GPU nodes). I suspect the problem is due to PyTorch depending on NCCL and magma, and both of these depend on UCX-CUDA. Does anyone have a suggestion for how to handle this? Can one build a version of NCCL and magma without UCX-CUDA? Can one disable it with an environment variable? Or something else? We would rather avoid having a completely different module tree for these few nodes, but if it is necessary then we will have to. Best regards Jakob

