Dear EasyBuilders,

I'm trying to build PyTorch-1.7.1-fosscuda-2020b.eb on a CentOS 7 server with some Nvidia GPUs, and the build fails in the tests after about 2 hours:

$ eb PyTorch-1.7.1-fosscuda-2020b.eb -r
== Temporary log file in case of crash /tmp/eb-zAAAvr/easybuild-TDNRVQ.log
== found valid index for /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using it... == found valid index for /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using it...
== resolving dependencies ...
== processing EasyBuild easyconfig /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb
== building and installing PyTorch/1.7.1-fosscuda-2020b...
== fetching files...
== creating build dir, resetting environment...
== unpacking...
== patching...
== preparing...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/PyTorch/1.7.1/fosscuda-2020b): build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou (took 1 hour 59 min 46 sec) == Results of the build can be found in the log file(s) /tmp/eb-zAAAvr/easybuild-PyTorch-1.7.1-20210601.074610.WfkGf.log ERROR: Build of /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb failed (err: 'build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou')


The EB log file shows these 4 errors at the end of the file:

======================================================================
ERROR: test_DistributedDataParallel (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
    self._join_processes(fn)
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
    self._check_return_codes(elapsed_time)
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_SyncBatchNorm (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
    self._join_processes(fn)
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
    self._check_return_codes(elapsed_time)
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
    self._join_processes(fn)
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
    self._check_return_codes(elapsed_time)
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

======================================================================
ERROR: test_DistributedDataParallel_with_grad_is_view (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 267, in wrapper
    self._join_processes(fn)
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 384, in _join_processes
    self._check_return_codes(elapsed_time)
File "/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 420, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10

----------------------------------------------------------------------
Ran 134 tests in 286.115s

FAILED (errors=4, skipped=91)
Traceback (most recent call last):
  File "run_test.py", line 745, in <module>
    main()
  File "run_test.py", line 728, in main
    raise RuntimeError(err_message)
RuntimeError: distributed/test_distributed_fork failed!
 (at easybuild/tools/run.py:537 in parse_cmd_output)
== 2021-06-01 09:45:57,406 filetools.py:1810 INFO Removing lock /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock... == 2021-06-01 09:45:57,407 filetools.py:347 INFO Path /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock successfully removed. == 2021-06-01 09:45:57,407 filetools.py:1814 INFO Lock removed: /home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock == 2021-06-01 09:45:57,407 easyblock.py:3414 WARNING build failed (first 300 chars): cmd "export PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --verbose -x distributed/rpc/test_process_group_agent test_quantization " exited with exit code 1 and ou == 2021-06-01 09:45:57,407 easyblock.py:298 INFO Closing log for application name PyTorch version 1.7.1


Question: Does anyone know how to fix these errors?

Thanks,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Reply via email to