It should not have complained about CUDA-enabled OpenMPI, but the fail is another problem.
On 12/22/21 3:09 PM, Loris Bennett wrote: > Hi Åke, > > Thank for the pointer. However, when I tried > > eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot > --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm > --tmpdir=/scratch/eb-build --from-pr 14496 > > it failed in the same way. > > Cheers, > > Loris > > Åke Sandgren <[email protected]> writes: > >> Reg the cuda-enabled openmpi problem, see PR >> https://github.com/easybuilders/easybuild-easyconfigs/pull/14496 >> >> On 12/21/21 1:34 PM, Loris Bennett wrote: >>> Hi, >>> >>> I am running >>> >>> eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot >>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm >>> --tmpdir=/scratch/eb-build >>> >>> on a GPU node. The build step succeeds but the tests fail with the error >>> >>> RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid >>> argument >>> >>> See below for full extract from the log file. >>> >>> There is a PyTorch issue >>> >>> https://github.com/pytorch/tensorpipe/issues/413 >>> >>> which seems related and we do indeed have an Omnipath fabric. >>> >>> On the other had, in the EB log file it says at some point: >>> >>> -- MPI libraries: >>> /trinity/shared/easybuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so >>> CMake Warning at cmake/Dependencies.cmake:1081 (message): >>> OpenMPI found, but it is not built with CUDA support. >>> >>> Could that be related? Is a CUDA-enabled OpenMPI needed? Or do we just >>> need to skip the test? >>> >>> Cheers, >>> >>> Loris >>> >>> ============================= test session starts >>> ============================== >>> platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- >>> /trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python >>> cachedir: .pytest_cache >>> hypothesis profile 'default' -> >>> database=DirectoryBasedExampleDatabase('/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/.hypothesis/examples') >>> torch: 1.10.0 >>> rootdir: /dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch, >>> configfile: pytest.ini >>> plugins: hypothesis-6.13.1 >>> collecting ... collected 13 items >>> >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] ERROR [ >>> 7%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] ERROR [ >>> 15%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] ERROR [ >>> 23%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] ERROR >>> [ 30%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] ERROR [ >>> 38%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] ERROR [ >>> 46%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] ERROR [ >>> 53%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR >>> [ 61%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] >>> ERROR [ 69%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] >>> ERROR [ 76%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] >>> ERROR [ 84%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] >>> ERROR [ 92%] >>> distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip ERROR >>> [100%] >>> >>> ==================================== ERRORS >>> ==================================== >>> _____________________ ERROR at setup of test_1to3[never-3] >>> _____________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 229, in _init_rpc_backend >>> rpc_agent = backend_registry.init_backend( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", >>> line 106, in init_backend >>> return backend.value.init_backend_handler(*args, **kwargs) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", >>> line 309, in _tensorpipe_init_backend_handler >>> api._init_rpc_states(agent) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/api.py", >>> line 114, in _init_rpc_states >>> _set_and_start_rpc_agent(agent) >>> RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid >>> argument >>> ___________________ ERROR at setup of test_1to3[never-1:2] >>> ____________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> ____________________ ERROR at setup of test_1to3[never-2:1] >>> ____________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> ___________________ ERROR at setup of test_1to3[never-1:1:1] >>> ___________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> ____________________ ERROR at setup of test_1to3[always-3] >>> _____________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> ___________________ ERROR at setup of test_1to3[always-1:2] >>> ____________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> ___________________ ERROR at setup of test_1to3[always-2:1] >>> ____________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> __________________ ERROR at setup of test_1to3[always-1:1:1] >>> ___________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> __________________ ERROR at setup of test_1to3[except_last-3] >>> __________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> _________________ ERROR at setup of test_1to3[except_last-1:2] >>> _________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> _________________ ERROR at setup of test_1to3[except_last-2:1] >>> _________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> ________________ ERROR at setup of test_1to3[except_last-1:1:1] >>> ________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> _______________________ ERROR at setup of test_none_skip >>> _______________________ >>> Traceback (most recent call last): >>> File >>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py", >>> line 44, in setup_rpc >>> dist.rpc.init_rpc( >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 195, in init_rpc >>> _init_rpc_backend(backend, store, name, rank, world_size, >>> rpc_backend_options) >>> File >>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", >>> line 226, in _init_rpc_backend >>> raise RuntimeError("RPC is already initialized") >>> RuntimeError: RPC is already initialized >>> =========================== short test summary info >>> ============================ >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] - >>> Runt... >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] - >>> Ru... >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] - >>> Ru... >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] >>> - ... >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] - >>> Run... >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] - >>> R... >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] - >>> R... >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] >>> ERROR >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] >>> ERROR >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] >>> ERROR >>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] >>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip - >>> RuntimeE... >>> ============================== 13 errors in 0.17s >>> ============================== >>> distributed/pipeline/sync/skip/test_gpipe failed! >>> Running distributed/pipeline/sync/skip/test_inspect_skip_layout ... >>> [2021-12-21 09:34:23.699450] >>> Executing >>> ['/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python', >>> '-m', 'pytest', >>> 'distributed/pipeline/sync/skip/test_inspect_skip_layout.py', '-v'] ... >>> [2021-12-21 09:34:23.699498] >>> >>> -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: [email protected] Phone: +46 90 7866134 Fax: +46 90-580 14 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

