Reg the cuda-enabled openmpi problem, see PR https://github.com/easybuilders/easybuild-easyconfigs/pull/14496
On 12/21/21 1:34 PM, Loris Bennett wrote: > Hi, > > I am running > > eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot > --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm > --tmpdir=/scratch/eb-build > > on a GPU node. The build step succeeds but the tests fail with the error > > RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid > argument > > See below for full extract from the log file. > > There is a PyTorch issue > > https://github.com/pytorch/tensorpipe/issues/413 > > which seems related and we do indeed have an Omnipath fabric. > > On the other had, in the EB log file it says at some point: > > -- MPI libraries: > /trinity/shared/easybuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so > CMake Warning at cmake/Dependencies.cmake:1081 (message): > OpenMPI found, but it is not built with CUDA support. > > Could that be related? Is a CUDA-enabled OpenMPI needed? Or do we just > need to skip the test? > > Cheers, > > Loris > > ============================= test session starts > ============================== > platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- > /trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python > cachedir: .pytest_cache > hypothesis profile 'default' -> > database=DirectoryBasedExampleDatabase('/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/.hypothesis/examples') > torch: 1.10.0 > rootdir: /dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch, configfile: > pytest.ini > plugins: hypothesis-6.13.1 > collecting ... collected 13 items > > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] ERROR [ > 7%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] ERROR [ > 15%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] ERROR [ > 23%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] ERROR [ > 30%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] ERROR [ > 38%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] ERROR [ > 46%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] ERROR [ > 53%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR [ > 61%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] ERROR > [ 69%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] > ERROR [ 76%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] > ERROR [ 84%] > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] > ERROR [ 92%] > distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip ERROR > [100%] > > ==================================== ERRORS > ==================================== > _____________________ ERROR at setup of test_1to3[never-3] > _____________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 229, in _init_rpc_backend > rpc_agent = backend_registry.init_backend( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", > line 106, in init_backend > return backend.value.init_backend_handler(*args, **kwargs) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", > line 309, in _tensorpipe_init_backend_handler > api._init_rpc_states(agent) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/api.py", > line 114, in _init_rpc_states > _set_and_start_rpc_agent(agent) > RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid > argument > ___________________ ERROR at setup of test_1to3[never-1:2] > ____________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > ____________________ ERROR at setup of test_1to3[never-2:1] > ____________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > ___________________ ERROR at setup of test_1to3[never-1:1:1] > ___________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > ____________________ ERROR at setup of test_1to3[always-3] > _____________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > ___________________ ERROR at setup of test_1to3[always-1:2] > ____________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > ___________________ ERROR at setup of test_1to3[always-2:1] > ____________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > __________________ ERROR at setup of test_1to3[always-1:1:1] > ___________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > __________________ ERROR at setup of test_1to3[except_last-3] > __________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > _________________ ERROR at setup of test_1to3[except_last-1:2] > _________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > _________________ ERROR at setup of test_1to3[except_last-2:1] > _________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > ________________ ERROR at setup of test_1to3[except_last-1:1:1] > ________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > _______________________ ERROR at setup of test_none_skip > _______________________ > Traceback (most recent call last): > File > "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", > line 44, in setup_rpc > dist.rpc.init_rpc( > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 195, in init_rpc > _init_rpc_backend(backend, store, name, rank, world_size, > rpc_backend_options) > File > "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", > line 226, in _init_rpc_backend > raise RuntimeError("RPC is already initialized") > RuntimeError: RPC is already initialized > =========================== short test summary info > ============================ > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] - > Runt... > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] - > Ru... > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] - > Ru... > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] - > ... > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] - > Run... > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] - > R... > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] - > R... > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] > ERROR > distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] > ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip - > RuntimeE... > ============================== 13 errors in 0.17s > ============================== > distributed/pipeline/sync/skip/test_gpipe failed! > Running distributed/pipeline/sync/skip/test_inspect_skip_layout ... > [2021-12-21 09:34:23.699450] > Executing > ['/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python', > '-m', 'pytest', 'distributed/pipeline/sync/skip/test_inspect_skip_layout.py', > '-v'] ... [2021-12-21 09:34:23.699498] > > -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: [email protected] Phone: +46 90 7866134 Fax: +46 90-580 14 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

