Hi Ole,
This error doesn't mean anything in particular for me, but perhaps it
rings a bell for Alexander (in CC).
There are a couple of fixes related to PyTorch that will be included in
the upcoming EasyBuild v4.4.0 release (which will be released tomorrow
hopefully), so keep an eye out for that...
regards,
Kenneth
On 01/06/2021 09:56, Ole Holm Nielsen wrote:
Dear EasyBuilders,
I'm trying to build PyTorch-1.7.1-fosscuda-2020b.eb on a CentOS 7 server
with some Nvidia GPUs, and the build fails in the tests after about 2
hours:
$ eb PyTorch-1.7.1-fosscuda-2020b.eb -r
== Temporary log file in case of crash /tmp/eb-zAAAvr/easybuild-TDNRVQ.log
== found valid index for
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using
it...
== found valid index for
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs, so using
it...
== resolving dependencies ...
== processing EasyBuild easyconfig
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb
== building and installing PyTorch/1.7.1-fosscuda-2020b...
== fetching files...
== creating build dir, resetting environment...
== unpacking...
== patching...
== preparing...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory:
/dev/shm/PyTorch/1.7.1/fosscuda-2020b): build failed (first 300 chars):
cmd "export
PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH
&& cd test && PYTHONUNBUFFERED=1
/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python
run_test.py --verbose -x distributed/rpc/test_process_group_agent
test_quantization " exited with exit code 1 and ou (took 1 hour 59 min
46 sec)
== Results of the build can be found in the log file(s)
/tmp/eb-zAAAvr/easybuild-PyTorch-1.7.1-20210601.074610.WfkGf.log
ERROR: Build of
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb
failed (err: 'build failed (first 300 chars): cmd "export
PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH
&& cd test && PYTHONUNBUFFERED=1
/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python
run_test.py --verbose -x distributed/rpc/test_process_group_agent
test_quantization " exited with exit code 1 and ou')
The EB log file shows these 4 errors at the end of the file:
======================================================================
ERROR: test_DistributedDataParallel (__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 267, in wrapper
self._join_processes(fn)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 384, in _join_processes
self._check_return_codes(elapsed_time)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 420, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10
======================================================================
ERROR: test_DistributedDataParallel_SyncBatchNorm
(__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 267, in wrapper
self._join_processes(fn)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 384, in _join_processes
self._check_return_codes(elapsed_time)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 420, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10
======================================================================
ERROR:
test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient
(__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 267, in wrapper
self._join_processes(fn)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 384, in _join_processes
self._check_return_codes(elapsed_time)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 420, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10
======================================================================
ERROR: test_DistributedDataParallel_with_grad_is_view
(__main__.TestDistBackendWithFork)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 267, in wrapper
self._join_processes(fn)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 384, in _join_processes
self._check_return_codes(elapsed_time)
File
"/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py",
line 420, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 0 1 2 exited with error code 10
----------------------------------------------------------------------
Ran 134 tests in 286.115s
FAILED (errors=4, skipped=91)
Traceback (most recent call last):
File "run_test.py", line 745, in <module>
main()
File "run_test.py", line 728, in main
raise RuntimeError(err_message)
RuntimeError: distributed/test_distributed_fork failed!
(at easybuild/tools/run.py:537 in parse_cmd_output)
== 2021-06-01 09:45:57,406 filetools.py:1810 INFO Removing lock
/home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock...
== 2021-06-01 09:45:57,407 filetools.py:347 INFO Path
/home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock
successfully removed.
== 2021-06-01 09:45:57,407 filetools.py:1814 INFO Lock removed:
/home/modules/software/.locks/_home_modules_software_PyTorch_1.7.1-fosscuda-2020b.lock
== 2021-06-01 09:45:57,407 easyblock.py:3414 WARNING build failed (first
300 chars): cmd "export
PYTHONPATH=/tmp/eb-zAAAvr/tmpnh77Vl/lib/python3.8/site-packages:$PYTHONPATH
&& cd test && PYTHONUNBUFFERED=1
/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python
run_test.py --verbose -x distributed/rpc/test_process_group_agent
test_quantization " exited with exit code 1 and ou
== 2021-06-01 09:45:57,407 easyblock.py:298 INFO Closing log for
application name PyTorch version 1.7.1
Question: Does anyone know how to fix these errors?
Thanks,
Ole