It should not have complained about CUDA-enabled OpenMPI, but the fail
is another problem.

On 12/22/21 3:09 PM, Loris Bennett wrote:
> Hi Åke,
> 
> Thank for the pointer.  However, when I tried
> 
>   eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot 
> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
> --tmpdir=/scratch/eb-build --from-pr 14496
> 
> it failed in the same way.
> 
> Cheers,
> 
> Loris
> 
> Åke Sandgren <[email protected]> writes:
> 
>> Reg the cuda-enabled openmpi problem, see PR
>> https://github.com/easybuilders/easybuild-easyconfigs/pull/14496
>>
>> On 12/21/21 1:34 PM, Loris Bennett wrote:
>>> Hi,
>>>
>>> I am running 
>>>
>>>   eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot 
>>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
>>> --tmpdir=/scratch/eb-build
>>>
>>> on a GPU node.  The build step succeeds but the tests fail with the error
>>>
>>>   RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid 
>>> argument
>>>
>>> See below for full extract from the log file.
>>>
>>> There is a PyTorch issue 
>>>
>>>   https://github.com/pytorch/tensorpipe/issues/413
>>>
>>> which seems related and we do indeed have an Omnipath fabric.
>>>
>>> On the other had, in the EB log file it says at some point:
>>>
>>>   -- MPI libraries: 
>>> /trinity/shared/easybuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so
>>>   CMake Warning at cmake/Dependencies.cmake:1081 (message):
>>>     OpenMPI found, but it is not built with CUDA support.
>>>
>>> Could that be related?  Is a CUDA-enabled OpenMPI needed?  Or do we just
>>> need to skip the test?
>>>
>>> Cheers,
>>>
>>> Loris
>>>
>>> ============================= test session starts 
>>> ==============================
>>> platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- 
>>> /trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python
>>> cachedir: .pytest_cache
>>> hypothesis profile 'default' -> 
>>> database=DirectoryBasedExampleDatabase('/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/.hypothesis/examples')
>>> torch: 1.10.0
>>> rootdir: /dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch, 
>>> configfile: pytest.ini
>>> plugins: hypothesis-6.13.1
>>> collecting ... collected 13 items
>>>
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] ERROR   [  
>>> 7%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] ERROR [ 
>>> 15%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] ERROR [ 
>>> 23%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] ERROR 
>>> [ 30%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] ERROR  [ 
>>> 38%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] ERROR [ 
>>> 46%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] ERROR [ 
>>> 53%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR 
>>> [ 61%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] 
>>> ERROR [ 69%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] 
>>> ERROR [ 76%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] 
>>> ERROR [ 84%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] 
>>> ERROR [ 92%]
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip ERROR       
>>> [100%]
>>>
>>> ==================================== ERRORS 
>>> ====================================
>>> _____________________ ERROR at setup of test_1to3[never-3] 
>>> _____________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 229, in _init_rpc_backend
>>>     rpc_agent = backend_registry.init_backend(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py",
>>>  line 106, in init_backend
>>>     return backend.value.init_backend_handler(*args, **kwargs)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py",
>>>  line 309, in _tensorpipe_init_backend_handler
>>>     api._init_rpc_states(agent)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/api.py",
>>>  line 114, in _init_rpc_states
>>>     _set_and_start_rpc_agent(agent)
>>> RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid 
>>> argument
>>> ___________________ ERROR at setup of test_1to3[never-1:2] 
>>> ____________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> ____________________ ERROR at setup of test_1to3[never-2:1] 
>>> ____________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> ___________________ ERROR at setup of test_1to3[never-1:1:1] 
>>> ___________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> ____________________ ERROR at setup of test_1to3[always-3] 
>>> _____________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> ___________________ ERROR at setup of test_1to3[always-1:2] 
>>> ____________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> ___________________ ERROR at setup of test_1to3[always-2:1] 
>>> ____________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> __________________ ERROR at setup of test_1to3[always-1:1:1] 
>>> ___________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> __________________ ERROR at setup of test_1to3[except_last-3] 
>>> __________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> _________________ ERROR at setup of test_1to3[except_last-1:2] 
>>> _________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> _________________ ERROR at setup of test_1to3[except_last-2:1] 
>>> _________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> ________________ ERROR at setup of test_1to3[except_last-1:1:1] 
>>> ________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> _______________________ ERROR at setup of test_none_skip 
>>> _______________________
>>> Traceback (most recent call last):
>>>   File 
>>> "/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
>>>  line 44, in setup_rpc
>>>     dist.rpc.init_rpc(
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 195, in init_rpc
>>>     _init_rpc_backend(backend, store, name, rank, world_size, 
>>> rpc_backend_options)
>>>   File 
>>> "/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
>>>  line 226, in _init_rpc_backend
>>>     raise RuntimeError("RPC is already initialized")
>>> RuntimeError: RPC is already initialized
>>> =========================== short test summary info 
>>> ============================
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] - 
>>> Runt...
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] - 
>>> Ru...
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] - 
>>> Ru...
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] 
>>> - ...
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] - 
>>> Run...
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] - 
>>> R...
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] - 
>>> R...
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1]
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3]
>>> ERROR 
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2]
>>> ERROR 
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1]
>>> ERROR 
>>> distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1]
>>> ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip - 
>>> RuntimeE...
>>> ============================== 13 errors in 0.17s 
>>> ==============================
>>> distributed/pipeline/sync/skip/test_gpipe failed!
>>> Running distributed/pipeline/sync/skip/test_inspect_skip_layout ... 
>>> [2021-12-21 09:34:23.699450]
>>> Executing 
>>> ['/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python',
>>>  '-m', 'pytest', 
>>> 'distributed/pipeline/sync/skip/test_inspect_skip_layout.py', '-v'] ... 
>>> [2021-12-21 09:34:23.699498]
>>>
>>>

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: [email protected]   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

Reply via email to