Reg the cuda-enabled openmpi problem, see PR
https://github.com/easybuilders/easybuild-easyconfigs/pull/14496

On 12/21/21 1:34 PM, Loris Bennett wrote:
> Hi,
> 
> I am running 
> 
>   eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot 
> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
> --tmpdir=/scratch/eb-build
> 
> on a GPU node.  The build step succeeds but the tests fail with the error
> 
>   RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid 
> argument
> 
> See below for full extract from the log file.
> 
> There is a PyTorch issue 
> 
>   https://github.com/pytorch/tensorpipe/issues/413
> 
> which seems related and we do indeed have an Omnipath fabric.
> 
> On the other had, in the EB log file it says at some point:
> 
>   -- MPI libraries: 
> /trinity/shared/easybuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so
>   CMake Warning at cmake/Dependencies.cmake:1081 (message):
>     OpenMPI found, but it is not built with CUDA support.
> 
> Could that be related?  Is a CUDA-enabled OpenMPI needed?  Or do we just
> need to skip the test?
> 
> Cheers,
> 
> Loris
> 
> ============================= test session starts 
> ==============================
> platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- 
> /trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python
> cachedir: .pytest_cache
> hypothesis profile 'default' -> 
> database=DirectoryBasedExampleDatabase('/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/.hypothesis/examples')
> torch: 1.10.0
> rootdir: /dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch, configfile: 
> pytest.ini
> plugins: hypothesis-6.13.1
> collecting ... collected 13 items
> 
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] ERROR   [  
> 7%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] ERROR [ 
> 15%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] ERROR [ 
> 23%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] ERROR [ 
> 30%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] ERROR  [ 
> 38%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] ERROR [ 
> 46%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] ERROR [ 
> 53%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR [ 
> 61%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] ERROR 
> [ 69%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] 
> ERROR [ 76%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] 
> ERROR [ 84%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] 
> ERROR [ 92%]
> distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip ERROR       
> [100%]
> 
> ==================================== ERRORS 
> ====================================
> _____________________ ERROR at setup of test_1to3[never-3] 
> _____________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 229, in _init_rpc_backend
>     rpc_agent = backend_registry.init_backend(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py",
>  line 106, in init_backend
>     return backend.value.init_backend_handler(*args, **kwargs)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py",
>  line 309, in _tensorpipe_init_backend_handler
>     api._init_rpc_states(agent)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/api.py",
>  line 114, in _init_rpc_states
>     _set_and_start_rpc_agent(agent)
> RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid 
> argument
> ___________________ ERROR at setup of test_1to3[never-1:2] 
> ____________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> ____________________ ERROR at setup of test_1to3[never-2:1] 
> ____________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> ___________________ ERROR at setup of test_1to3[never-1:1:1] 
> ___________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> ____________________ ERROR at setup of test_1to3[always-3] 
> _____________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> ___________________ ERROR at setup of test_1to3[always-1:2] 
> ____________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> ___________________ ERROR at setup of test_1to3[always-2:1] 
> ____________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> __________________ ERROR at setup of test_1to3[always-1:1:1] 
> ___________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> __________________ ERROR at setup of test_1to3[except_last-3] 
> __________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> _________________ ERROR at setup of test_1to3[except_last-1:2] 
> _________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> _________________ ERROR at setup of test_1to3[except_last-2:1] 
> _________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> ________________ ERROR at setup of test_1to3[except_last-1:1:1] 
> ________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> _______________________ ERROR at setup of test_none_skip 
> _______________________
> Traceback (most recent call last):
>   File 
> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>  line 44, in setup_rpc
>     dist.rpc.init_rpc(
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 195, in init_rpc
>     _init_rpc_backend(backend, store, name, rank, world_size, 
> rpc_backend_options)
>   File 
> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>  line 226, in _init_rpc_backend
>     raise RuntimeError("RPC is already initialized")
> RuntimeError: RPC is already initialized
> =========================== short test summary info 
> ============================
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] - 
> Runt...
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] - 
> Ru...
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] - 
> Ru...
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] - 
> ...
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] - 
> Run...
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] - 
> R...
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] - 
> R...
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1]
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3]
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2]
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1]
> ERROR 
> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1]
> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip - 
> RuntimeE...
> ============================== 13 errors in 0.17s 
> ==============================
> distributed/pipeline/sync/skip/test_gpipe failed!
> Running distributed/pipeline/sync/skip/test_inspect_skip_layout ... 
> [2021-12-21 09:34:23.699450]
> Executing 
> ['/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python', 
> '-m', 'pytest', 'distributed/pipeline/sync/skip/test_inspect_skip_layout.py', 
> '-v'] ... [2021-12-21 09:34:23.699498]
> 
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: [email protected]   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

Reply via email to