On 5/27/21 10:46 AM, Alexander Grund wrote:
> Alexandre: should we look for patterns like "No space left on device"
in the Bazel output and highlight them better, perhaps with a concrete
suggestion to use --tmpdir to avoid the usage of /tmp?
We could in general put something into EasyBuild, yes. I started a PR with
enhanced error parsing which could maybe be used for that.
I've configured some larger temporary file spaces:
EASYBUILD_TMPDIR=/scratch/modules (800+ GB available)
EASYBUILD_BUILDPATH=/dev/shm (94 GB size)
and try to build TensorFlow:
$ eb TensorFlow-2.4.1-fosscuda-2020b.eb
--cuda-compute-capabilities=8.0,8.6 --tmpdir=/scratch/modules
== installing extension TensorFlow 2.4.1 (28/28)...
== configuring...
== building...
== testing...
== FAILED: Installation ended unsuccessfully (build directory:
/dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300 chars):
At least 2 gpu tests failed:
//tensorflow/core/common_runtime/gpu:gpu_device_test,
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
(took 55 min 27 sec)
== Results of the build can be found in the log file(s)
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log
ERROR: Build of
/home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb
failed (err: 'build failed (first 300 chars): At least 2 gpu tests
failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test,
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')
In the logfile I see multiple FAILED tests:
$ grep FAILED
/scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log
FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (79 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (323 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
FAILED:
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
(Summary)
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (40 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (158 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
//tensorflow/core/common_runtime/gpu:gpu_device_test
FAILED in 3 out of 3 in 4.8s
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
FAILED in 3 out of 3 in 3.5s
FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (79
ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (323
ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128
ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
FAILED:
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
(Summary)
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (40
ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (158
ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77
ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
//tensorflow/core/common_runtime/gpu:gpu_device_test
FAILED in 3 out of 3 in 4.8s
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
FAILED in 3 out of 3 in 3.5s)
FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (79 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (323 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
FAILED:
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
(Summary)
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (40 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (158 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
INFO: Build completed, 2 tests FAILED, 1912 total actions
//tensorflow/core/common_runtime/gpu:gpu_device_test
FAILED in 3 out of 3 in 4.8s
//tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu
FAILED in 3 out of 3 in 3.5s
INFO: Build completed, 2 tests FAILED, 1912 total actions
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 ms)
[ FAILED ] 1 test, listed below:
[ FAILED ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
1 FAILED TEST
Is there anything else I should look for in the logfile (size: 234 MB)?
Thanks,
Ole