areusch commented on issue #7995: URL: https://github.com/apache/tvm/issues/7995#issuecomment-845368725
apologies for the light updates here. we determined that the "Frontend : GPU" tests get into a state where either the GPU hardware is inaccessible after a while or TVM's existence check is wrong. Since we didn't change the CUDA version used here--we just updated to 18.04--the theory is that there is some interoperability problem between CUDA running in the containers (at 10.0) and the CUDA driver loaded on the docker host side (either 10.2 or 11.0, depending which CI node you hit). @tkonolige and I have spent the last couple days running on a test TVM CI cluster using the same AMI (which has only CUDA 11.0). With CUDA 10.0 (ci-gpu) and 11.0 (host), we ran into another similar-looking bug during the GPU unit tests: ``` [ RUN ] BuildModule.Heterogeneous [22:11:58] /workspace/src/target/opt/build_cuda_on.cc:89: Warning: cannot detect compute capability from your device, fall back to compute_30. unknown file: Failure C++ exception with description "[22:11:58] /workspace/src/runtime/cuda/cuda_device_api.cc:117: --------------------------------------------------------------- An error occurred during the execution of TVM. For more information, please see: https://tvm.apache.org/docs/errors.html --------------------------------------------------------------- Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: no CUDA-capable device is detected ``` We then upgraded ci-gpu to use CUDA 11.0, and this test seemed to pass all the way to the end of the GPU integration tests, modulo a tolerance issue: ``` tests/python/contrib/test_cublas.py::test_batch_matmul FAILED // ... > np.testing.assert_allclose(actual, desired, rtol=rtol, atol=atol, verbose=True) E AssertionError: E Not equal to tolerance rtol=1e-05, atol=1e-07 E E Mismatched elements: 2875175 / 3866624 (74.4%) E Max absolute difference: 0.00541687 E Max relative difference: 0.00015383 E x: array([[[29.647408, 31.88966 , 33.90233 , ..., 34.673954, 32.908764, E 31.219051], E [30.993076, 30.78019 , 33.67124 , ..., 36.1395 , 29.176218,... E y: array([[[29.646427, 31.889557, 33.900528, ..., 34.673126, 32.90791 , E 31.21726 ], E [30.991737, 30.780437, 33.67001 , ..., 36.139397, 29.174744,... ``` we'll try and push this CUDA 11.0 ci-gpu container through the test CI cluster to see how far we can get. feel free to comment if there are concerns updating to CUDA 11.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
