Hi,

I am running 

  eb PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
--tmpdir=/scratch/eb-build

on a GPU node.  The build step succeeds but the tests fail with the error

  RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid 
argument

See below for full extract from the log file.

There is a PyTorch issue 

  https://github.com/pytorch/tensorpipe/issues/413

which seems related and we do indeed have an Omnipath fabric.

On the other had, in the EB log file it says at some point:

  -- MPI libraries: 
/trinity/shared/easybuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so
  CMake Warning at cmake/Dependencies.cmake:1081 (message):
    OpenMPI found, but it is not built with CUDA support.

Could that be related?  Is a CUDA-enabled OpenMPI needed?  Or do we just
need to skip the test?

Cheers,

Loris

============================= test session starts ==============================
platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- 
/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> 
database=DirectoryBasedExampleDatabase('/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/.hypothesis/examples')
torch: 1.10.0
rootdir: /dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch, configfile: 
pytest.ini
plugins: hypothesis-6.13.1
collecting ... collected 13 items

distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] ERROR   [  7%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] ERROR [ 15%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] ERROR [ 23%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] ERROR [ 
30%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] ERROR  [ 38%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] ERROR [ 46%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] ERROR [ 53%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1] ERROR [ 
61%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3] ERROR [ 
69%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2] ERROR 
[ 76%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1] ERROR 
[ 84%]
distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1] 
ERROR [ 92%]
distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip ERROR       [100%]

==================================== ERRORS ====================================
_____________________ ERROR at setup of test_1to3[never-3] _____________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 229, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py",
 line 106, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py",
 line 309, in _tensorpipe_init_backend_handler
    api._init_rpc_states(agent)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/api.py",
 line 114, in _init_rpc_states
    _set_and_start_rpc_agent(agent)
RuntimeError: In operator() at tensorpipe/common/ibv.h:172 "": Invalid argument
___________________ ERROR at setup of test_1to3[never-1:2] ____________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
____________________ ERROR at setup of test_1to3[never-2:1] ____________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
___________________ ERROR at setup of test_1to3[never-1:1:1] ___________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
____________________ ERROR at setup of test_1to3[always-3] _____________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
___________________ ERROR at setup of test_1to3[always-1:2] ____________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
___________________ ERROR at setup of test_1to3[always-2:1] ____________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
__________________ ERROR at setup of test_1to3[always-1:1:1] ___________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
__________________ ERROR at setup of test_1to3[except_last-3] __________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
_________________ ERROR at setup of test_1to3[except_last-1:2] _________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
_________________ ERROR at setup of test_1to3[except_last-2:1] _________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
________________ ERROR at setup of test_1to3[except_last-1:1:1] ________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
_______________________ ERROR at setup of test_none_skip _______________________
Traceback (most recent call last):
  File 
"/dev/shm/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/pipeline/sync/conftest.py",
 line 44, in setup_rpc
    dist.rpc.init_rpc(
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 195, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, 
rpc_backend_options)
  File 
"/scratch/eb-build/eb-jl3G9p/tmpdboomz/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py",
 line 226, in _init_rpc_backend
    raise RuntimeError("RPC is already initialized")
RuntimeError: RPC is already initialized
=========================== short test summary info ============================
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-3] - Runt...
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:2] - Ru...
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-2:1] - Ru...
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[never-1:1:1] - ...
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-3] - Run...
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:2] - R...
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-2:1] - R...
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[always-1:1:1]
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-3]
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:2]
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-2:1]
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_1to3[except_last-1:1:1]
ERROR distributed/pipeline/sync/skip/test_gpipe.py::test_none_skip - RuntimeE...
============================== 13 errors in 0.17s ==============================
distributed/pipeline/sync/skip/test_gpipe failed!
Running distributed/pipeline/sync/skip/test_inspect_skip_layout ... [2021-12-21 
09:34:23.699450]
Executing 
['/trinity/shared/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python', 
'-m', 'pytest', 'distributed/pipeline/sync/skip/test_inspect_skip_layout.py', 
'-v'] ... [2021-12-21 09:34:23.699498]


-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin         Email [email protected]

Reply via email to