[GitHub] [tvm] areusch commented on issue #7995: Update all ci- containers to reflect main

GitBox Thu, 20 May 2021 11:37:30 -0700


areusch commented on issue #7995:
URL: https://github.com/apache/tvm/issues/7995#issuecomment-845368725



   apologies for the light updates here. we determined that the "Frontend : 
GPU" tests get into a state where either the GPU hardware is inaccessible after 
a while or TVM's existence check is wrong. Since we didn't change the CUDA 
version used here--we just updated to 18.04--the theory is that there is some 
interoperability problem between CUDA running in the containers (at 10.0) and 
the CUDA driver loaded on the docker host side (either 10.2 or 11.0, depending 
which CI node you hit).
   
   @tkonolige and I have spent the last couple days running on a test TVM CI 
cluster using the same AMI (which has only CUDA 11.0). With CUDA 10.0 (ci-gpu) 
and 11.0 (host), we ran into another similar-looking bug during the GPU unit 
tests:
   ```
   [ RUN      ] BuildModule.Heterogeneous
   [22:11:58] /workspace/src/target/opt/build_cuda_on.cc:89: Warning: cannot 
detect compute capability from your device, fall back to compute_30.
   unknown file: Failure
   C++ exception with description "[22:11:58] 
/workspace/src/runtime/cuda/cuda_device_api.cc:117: 
   ---------------------------------------------------------------
   An error occurred during the execution of TVM.
   For more information, please see: https://tvm.apache.org/docs/errors.html
   ---------------------------------------------------------------
     Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is 
false: CUDA: no CUDA-capable device is detected
   ```
   
   We then upgraded ci-gpu to use CUDA 11.0, and this test seemed to pass all 
the way to the end of the GPU integration tests, modulo a tolerance issue:
   ```
   tests/python/contrib/test_cublas.py::test_batch_matmul FAILED
   // ...
   >       np.testing.assert_allclose(actual, desired, rtol=rtol, atol=atol, 
verbose=True)
   E       AssertionError: 
   E       Not equal to tolerance rtol=1e-05, atol=1e-07
   E       
   E       Mismatched elements: 2875175 / 3866624 (74.4%)
   E       Max absolute difference: 0.00541687
   E       Max relative difference: 0.00015383
   E        x: array([[[29.647408, 31.88966 , 33.90233 , ..., 34.673954, 
32.908764,
   E                31.219051],
   E               [30.993076, 30.78019 , 33.67124 , ..., 36.1395  , 
29.176218,...
   E        y: array([[[29.646427, 31.889557, 33.900528, ..., 34.673126, 
32.90791 ,
   E                31.21726 ],
   E               [30.991737, 30.780437, 33.67001 , ..., 36.139397, 
29.174744,...
   ```
   
   we'll try and push this CUDA 11.0 ci-gpu container through the test CI 
cluster to see how far we can get. feel free to comment if there are concerns 
updating to CUDA 11.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] areusch commented on issue #7995: Update all ci- containers to reflect main

Reply via email to