On 5/27/21 10:46 AM, Alexander Grund wrote:
> Alexandre: should we look for patterns like "No space left on device" in the Bazel output and highlight them better, perhaps with a concrete suggestion to use --tmpdir to avoid the usage of /tmp?

We could in general put something into EasyBuild, yes. I started a PR with enhanced error parsing which could maybe be used for that.

I've configured some larger temporary file spaces:
EASYBUILD_TMPDIR=/scratch/modules  (800+ GB available)
EASYBUILD_BUILDPATH=/dev/shm   (94 GB size)

and try to build TensorFlow:

$ eb TensorFlow-2.4.1-fosscuda-2020b.eb --cuda-compute-capabilities=8.0,8.6 --tmpdir=/scratch/modules

== installing extension TensorFlow 2.4.1 (28/28)...
==      configuring...
==      building...
==      testing...
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/TensorFlow/2.4.1/fosscuda-2020b): build failed (first 300 chars): At least 2 gpu tests failed: //tensorflow/core/common_runtime/gpu:gpu_device_test, //tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu (took 55 min 27 sec) == Results of the build can be found in the log file(s) /scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log ERROR: Build of /home/modules/software/EasyBuild/4.3.4/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.4.1-fosscuda-2020b.eb failed (err: 'build failed (first 300 chars): At least 2 gpu tests failed:\n//tensorflow/core/common_runtime/gpu:gpu_device_test, //tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu')

In the logfile I see multiple FAILED tests:

$ grep FAILED /scratch/modules/eb-3l5Ptk/easybuild-TensorFlow-2.4.1-20210527.114011.EmOkP.log
FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (79 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (323 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu (Summary)
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (40 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (158 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
//tensorflow/core/common_runtime/gpu:gpu_device_test FAILED in 3 out of 3 in 4.8s //tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu FAILED in 3 out of 3 in 3.5s
        FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (79 
ms)
        [  FAILED  ] 1 test, listed below:
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
         1 FAILED TEST
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (323 
ms)
        [  FAILED  ] 1 test, listed below:
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
         1 FAILED TEST
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 
ms)
        [  FAILED  ] 1 test, listed below:
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
         1 FAILED TEST
FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu (Summary)
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (40 
ms)
        [  FAILED  ] 1 test, listed below:
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
         1 FAILED TEST
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (158 
ms)
        [  FAILED  ] 1 test, listed below:
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
         1 FAILED TEST
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 
ms)
        [  FAILED  ] 1 test, listed below:
        [  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
         1 FAILED TEST
//tensorflow/core/common_runtime/gpu:gpu_device_test FAILED in 3 out of 3 in 4.8s //tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu FAILED in 3 out of 3 in 3.5s)
FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_test (Summary)
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (79 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (323 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
FAILED: //tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu (Summary)
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (40 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (158 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
INFO: Build completed, 2 tests FAILED, 1912 total actions
//tensorflow/core/common_runtime/gpu:gpu_device_test FAILED in 3 out of 3 in 4.8s //tensorflow/core/common_runtime/gpu:gpu_device_unified_memory_test_gpu FAILED in 3 out of 3 in 3.5s
INFO: Build completed, 2 tests FAILED, 1912 total actions
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (77 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority (128 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] GPUDeviceTest.SingleVirtualDeviceWithInvalidPriority
 1 FAILED TEST


Is there anything else I should look for in the logfile (size: 234 MB)?

Thanks,
Ole


Reply via email to