https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122783
Bug ID: 122783
Summary: gcc 15,16 refuses to compile correctly with nvptx
target blackwell cards after drivers with cuda-13 were
installed (result of an abi change in cuda-13)
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: schulz.benjamin at googlemail dot com
Target Milestone: ---
First, I have installed
x11-drivers/nvidia-drivers-580.95.05:0/580::gentoo
dev-util/nvidia-cuda-toolkit-12.9.1-r1:0/12.9.1::gentoo
sys-kernel/gentoo-kernel-6.17.8:6.17.8::gentoo
nvidia-smi says:
NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0
NVIDIA GeForce RTX 5060 Ti
nvptx-arch:
sm_120
Also, I have compiled an offload gcc.
I compile the following code with
gcc -fopenmp -foffload=nvptx-none -save-temps -foffload-options=-march=sm_89
-fno-stack-protector -o gpu_compiler_test ./main.cpp
I use sm_89, since a higher sm seems not to be supported by gcc 16 currently.
---------------------
#include "omp.h"
int main(int argc, char** argv){
int i=0;
#pragma omp target map(tofrom:i)
{
i=i+1;
}
return 0;
}
-----------------
this compiles, and also seems to run.
Without giving an error!
(Severe errors would only come when you write code that explicitely accesses
arrays allocated with omp_target_alloc.)
Errors appear when checking for cuda problems. Running
compute-sanitizer --tool memcheck ./gpu_compiler_test
will show this:
========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid
device context" on CUDA API call to cuCtxGetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: GOMP_OFFLOAD_init_device in plugin-nvptx.c:1371
[0x5677] in libgomp-plugin-nvptx.so.1
========= Host Frame: gomp_init_device in target.c:3000 [0x3fc05] in
libgomp.so.1
========= Host Frame: resolve_device in target.c:190 [0x3fdf4] in
libgomp.so.1
========= Host Frame: GOMP_target_ext in target.c:3279 [0x4067e] in
libgomp.so.1
========= Host Frame: main [0x1230] in gpu_compiler_test
=========
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetGlobal_v2.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1728
[0x7688] in libgomp-plugin-nvptx.so.1
========= Host Frame: gomp_load_image_to_device in target.c:2608
[0x34100] in libgomp.so.1
========= Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in
libgomp.so.1
========= Host Frame: resolve_device in target.c:190 [0x3fdf4] in
libgomp.so.1
========= Host Frame: GOMP_target_ext in target.c:3279 [0x4067e] in
libgomp.so.1
========= Host Frame: main [0x1230] in gpu_compiler_test
=========
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetGlobal_v2.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1800
[0x76d7] in libgomp-plugin-nvptx.so.1
========= Host Frame: gomp_load_image_to_device in target.c:2608
[0x34100] in libgomp.so.1
========= Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in
libgomp.so.1
========= Host Frame: resolve_device in target.c:190 [0x3fdf4] in
libgomp.so.1
========= Host Frame: GOMP_target_ext in target.c:3279 [0x4067e] in
libgomp.so.1
========= Host Frame: main [0x1230] in gpu_compiler_test
=========
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetFunction.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: nvptx_do_global_cdtors in plugin-nvptx.c:1464
[0x3cdb] in libgomp-plugin-nvptx.so.1
========= Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1802
[0x7739] in libgomp-plugin-nvptx.so.1
========= Host Frame: gomp_load_image_to_device in target.c:2608
[0x34100] in libgomp.so.1
========= Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in
libgomp.so.1
========= Host Frame: resolve_device in target.c:190 [0x3fdf4] in
libgomp.so.1
========= Host Frame: GOMP_target_ext in target.c:3279 [0x4067e] in
libgomp.so.1
========= Host Frame: main [0x1230] in gpu_compiler_test
=========
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetFunction.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: nvptx_do_global_cdtors in plugin-nvptx.c:1482
[0x3ef6] in libgomp-plugin-nvptx.so.1
========= Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1802
[0x7739] in libgomp-plugin-nvptx.so.1
========= Host Frame: gomp_load_image_to_device in target.c:2608
[0x34100] in libgomp.so.1
========= Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in
libgomp.so.1
========= Host Frame: resolve_device in target.c:190 [0x3fdf4] in
libgomp.so.1
========= Host Frame: GOMP_target_ext in target.c:3279 [0x4067e] in
libgomp.so.1
========= Host Frame: main [0x1230] in gpu_compiler_test
=========
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetFunction.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: nvptx_do_global_cdtors in plugin-nvptx.c:1464
[0x3cdb] in libgomp-plugin-nvptx.so.1
========= Host Frame: GOMP_OFFLOAD_fini_device in plugin-nvptx.c:1407
[0x6042] in libgomp-plugin-nvptx.so.1
========= Host Frame: gomp_target_fini in target.c:6104 [0x32ace] in
libgomp.so.1
========= Host Frame: [0x471d0] in libc.so.6
========= Host Frame: exit [0x472af] in libc.so.6
========= Host Frame: [0x27611] in libc.so.6
========= Host Frame: __libc_start_main [0x276b9] in libc.so.6
========= Host Frame: _start [0x10c4] in gpu_compiler_test
=========
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not
found" on CUDA API call to cuModuleGetFunction.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: nvptx_do_global_cdtors in plugin-nvptx.c:1482
[0x3ef6] in libgomp-plugin-nvptx.so.1
========= Host Frame: GOMP_OFFLOAD_fini_device in plugin-nvptx.c:1407
[0x6042] in libgomp-plugin-nvptx.so.1
========= Host Frame: gomp_target_fini in target.c:6104 [0x32ace] in
libgomp.so.1
========= Host Frame: [0x471d0] in libc.so.6
========= Host Frame: exit [0x472af] in libc.so.6
========= Host Frame: [0x27611] in libc.so.6
========= Host Frame: __libc_start_main [0x276b9] in libc.so.6
========= Host Frame: _start [0x10c4] in gpu_compiler_test
=========
========= ERROR SUMMARY: 7 errors
Remarkably, it does not show an error to the console. If there were a loop in
the target region, it would even show some numbers of calculations, because of
a host fallback.
In a matrix multiplication, differences would occur if one uses a target teams
distribute parallel for collapse(2) statement, rather than a target teams
distribute for the first loop and a parallel for for the second loop...
And libgomp would crash if memory is accessed that was allocated by
omp_target_alloc
But the cuda errors would only show on explicit check, which may be dangerous
for unwitting users.
This may be connected to an abi change by cuda 13 abi that gets installed with
the new driver. Clang had solved this with this solution: New entries for every
device in cuda-13:
https://github.com/llvm/llvm-project/commit/dffd7f3d9a3294d21205251b986e76ec841cc750
After that fix for cuda-13, Clang compiles the example above of course, and it
runs on the target gpu without any (cuda) error and visible kernels.
I close the other 2 bugs now that were related to this problem, where I did not
recognize that this was not an error of a construct but that it could not
offload at all....