Hi, I am running
eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-build on a GPU node. The build step succeeds but the tests fail with the error RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid argument See below for full extract from the log file. There is a PyTorch issue https://github.com/pytorch/tensorpipe/issues/413 which seems related and we do indeed have an Omnipath fabric. On the other had, in the EB log file it says at some point: -- MPI libraries: /trinity/shared/easybuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so CMake Warning at cmake/Dependencies.cmake:1081 (message): OpenMPI found, but it is not built with CUDA support. Could that be related? Is a CUDA-enabled OpenMPI needed? Or do we just need to skip the test? Cheers, Loris ============================= test session starts ============================== platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/.hypothesis/examples') torch: 1.10.0 rootdir: /dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch, configfile: pytest.ini plugins: hypothesis-6.13.1 collecting ... collected 13 items distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] ERROR [ 7%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] ERROR [ 15%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] ERROR [ 23%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] ERROR [ 30%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] ERROR [ 38%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] ERROR [ 46%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] ERROR [ 53%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR [ 61%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] ERROR [ 69%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] ERROR [ 76%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] ERROR [ 84%] distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] ERROR [ 92%] distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip ERROR [100%] ==================================== ERRORS ==================================== _____________________ ERROR at setup of test_1to3[never-3] _____________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 229, in _init_rpc_backend rpc_agent = backend_registry.init_backend( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 106, in init_backend return backend.value.init_backend_handler(*args, **kwargs) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 309, in _tensorpipe_init_backend_handler api._init_rpc_states(agent) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/api.py", line 114, in _init_rpc_states _set_and_start_rpc_agent(agent) RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid argument ___________________ ERROR at setup of test_1to3[never-1:2] ____________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized ____________________ ERROR at setup of test_1to3[never-2:1] ____________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized ___________________ ERROR at setup of test_1to3[never-1:1:1] ___________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized ____________________ ERROR at setup of test_1to3[always-3] _____________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized ___________________ ERROR at setup of test_1to3[always-1:2] ____________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized ___________________ ERROR at setup of test_1to3[always-2:1] ____________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized __________________ ERROR at setup of test_1to3[always-1:1:1] ___________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized __________________ ERROR at setup of test_1to3[except_last-3] __________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized _________________ ERROR at setup of test_1to3[except_last-1:2] _________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized _________________ ERROR at setup of test_1to3[except_last-2:1] _________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized ________________ ERROR at setup of test_1to3[except_last-1:1:1] ________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized _______________________ ERROR at setup of test_none_skip _______________________ Traceback (most recent call last): File "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", line 44, in setup_rpc dist.rpc.init_rpc( File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 195, in init_rpc _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options) File "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 226, in _init_rpc_backend raise RuntimeError("RPC is already initialized") RuntimeError: RPC is already initialized =========================== short test summary info ============================ ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] - Runt... ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] - Ru... ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] - Ru... ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] - ... ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] - Run... ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] - R... ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] - R... ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip - RuntimeE... ============================== 13 errors in 0.17s ============================== distributed/pipeline/sync/skip/test_gpipe failed! Running distributed/pipeline/sync/skip/test_inspect_skip_layout ... [2021-12-21 09:34:23.699450] Executing ['/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python', '-m', 'pytest', 'distributed/pipeline/sync/skip/test_inspect_skip_layout.py', '-v'] ... [2021-12-21 09:34:23.699498] -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin Email [email protected]

