https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280

--- Comment #19 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
I want to note that I find it a serious implementation problem if a program
encounters errors like this:

========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid
device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetGlobal_v2.

and nevertheless continues to run but prints out wrong numbers.

I have now seen that people ran into similar errors with pytorch and tensorflow
when the gpu was too new for the implementation, or too old... 

https://discuss.pytorch.org/t/pytorch-cuda-returns-error-500-named-symbol-not-found/218501

But they then see CUDA_ERROR_NOT_FOUND (error 500) in their program and then
the program execution stops.


In the case above, I ust see errouneous numbers from a matrix multiplication
but the execution continues without error.


Imagine this would be not a matrix multiplication, but an application used by a
doctor that inspects medical images for brest cancer. 

The doctor installs a shiny new gpu in his computer and expects the software to
run faster.... 

You can not expect users to run cuda-sanitizer for each gpu/software/cuda
combination before they get run.

If something like that CUDA_ERROR_NOT_FOUND (error 500) occurs, then the
program should just stop. 

Or not even compile when the generated code is not 100% compatible to the
hardware. This should be included in the omp runtime. It needs to check whether
the compiler/binary supports the gpu, and otherwise, refuse to compile/run the
application.

Reply via email to