[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

schulz.benjamin at googlemail dot com via Gcc-bugs Wed, 19 Nov 2025 09:15:46 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


--- Comment #19 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
I want to note that I find it a serious implementation problem if a program
encounters errors like this:

========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid
device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetGlobal_v2.

and nevertheless continues to run but prints out wrong numbers.

I have now seen that people ran into similar errors with pytorch and tensorflow
when the gpu was too new for the implementation, or too old... 

https://discuss.pytorch.org/t/pytorch-cuda-returns-error-500-named-symbol-not-found/218501

But they then see CUDA_ERROR_NOT_FOUND (error 500) in their program and then
the program execution stops.


In the case above, I ust see errouneous numbers from a matrix multiplication
but the execution continues without error.


Imagine this would be not a matrix multiplication, but an application used by a
doctor that inspects medical images for brest cancer. 

The doctor installs a shiny new gpu in his computer and expects the software to
run faster.... 

You can not expect users to run cuda-sanitizer for each gpu/software/cuda
combination before they get run.

If something like that CUDA_ERROR_NOT_FOUND (error 500) occurs, then the
program should just stop. 

Or not even compile when the generated code is not 100% compatible to the
hardware. This should be included in the omp runtime. It needs to check whether
the compiler/binary supports the gpu, and otherwise, refuse to compile/run the
application.

[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to