On 5/27/21 1:06 PM, Alexander Grund wrote:
Yes: At the very bottom of the log there should more information about the
failed tests. For each of those (2) tests there should be some more
detailed output
Search for "At least 2 gpu tests failed" and look below.
This is at the very end of the logfile:
[----------] Global test environment tear-down
[==========] 19 tests from 2 test suites ran. (2972 ms total)
[ PASSED ] 18 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
== 2021-05-27 12:35:39,386 build_log.py:169 ERROR EasyBuild crashed with
an error (at easybuild/base/exceptions.py:124 in __init__): At least 2 gpu
tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test,
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
(at easybuild/easyblocks/t/tensorflow.py:973 in test_step)
== 2021-05-27 12:35:39,386 filetools.py:1810 INFO Removing lock
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock...
== 2021-05-27 12:35:39,387 filetools.py:347 INFO Path
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock
successfully removed.
== 2021-05-27 12:35:39,388 filetools.py:1814 INFO Lock removed:
/home/modules/software/.locks/_home_modules_software_TensorFlow_2.4.1-fosscuda-2020b.lock
== 2021-05-27 12:35:39,388 easyblock.py:3414 WARNING build failed (first
300 chars): At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test,
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
== 2021-05-27 12:35:39,388 easyblock.py:298 INFO Closing log for
application name TensorFlow version 2.4.1
If you would help by analyzing the logfile, I can gzip it and send you an URL?
Thanks,
Ole
FYI: Setting EASYBUILD_TMPDIR to a large directory is not required.
Temporary files are usually small.
Am 27.05.21 um 13:02 schrieb Ole Holm Nielsen:
On 5/27/21 10:46 AM, Alexander Grund wrote:
> Alexandre: should we look for patterns like "No space left on
device" in the Bazel output and highlight them better, perhaps with a
concrete suggestion to use --tmpdir to avoid the usage of /tmp?
We could in general put something into EasyBuild, yes. I started a PR
with enhanced error parsing which could maybe be used for that.
I've configured some larger temporary file spaces:
EASYBUILD_TMPDIR=/scratch/modules (800+ GB available)
EASYBUILD_BUILDPATH=/dev/shm (94 GB size)
and try to build TensorFlow:
$ eb TensorFlow-2.4.1-fosscuda-2020b.eb
--cuda-compute-capabilities=8.0,8.6 --tmpdir=/scratch/modules
== installing extension TensorFlow 2.4.1 (28/28)...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory:
/dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300
chars): At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test,
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
(took 55 min 27 sec)
== Results of the build can be found in the log file(s)
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log
ERROR: Build of
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb
failed (err: 'build failed (first 300 chars): At least 2 gpu tests
failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test,
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')
...
Is there anything else I should look for in the logfile (size: 234 MB)?