https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122281
--- Comment #20 from Benjamin Schulz <schulz.benjamin at googlemail dot com> --- Created attachment 62844 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62844&action=edit compute-sanitizer-log.txt Hi there, for the sparsetests.cpp file, I want to note that there is a dmesg message on my gpu. When I run it, I get: [ 4841.758732] perf: interrupt took too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 79200 [21014.299014] NVRM: GPU at PCI:0000:2d:00: GPU-959d8d20-bc99-9be9-c49f-adc2913c511c [21014.299020] NVRM: GPU Board Serial Number: 0 [21014.299022] NVRM: Xid (PCI:0000:2d:00): 31, pid=25568, name=sparsetests, channel 0x00000018, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC0 GPCCLIENT_T1_5 faulted @ 0x7fff_7be14000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ When compiling with clang, I get no such message and the application runs fine When i run mathdemonstrations.cpp, I get a similar error with gcc: [21456.631660] NVRM: Xid (PCI:0000:2d:00): 31, pid=26607, name=mathdemonstrati, channel 0x00000018, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC0 GPCCLIENT_T1_1 faulted @ 0x7ffd_91678000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ With clang, I also do not get such a message and the application runs fine. Running compute-sanitizer --tool memcheck ./sparsetests shows no errors with the clang generated output. my cuda version is this: dev-util/nvidia-cuda-toolkit-12.9.1-r1:0/12.9.1::gentoo x11-drivers/nvidia-drivers-580.95.05:0/580::gentoo sys-kernel/gentoo-kernel-6.17.8:6.17.8::gentoo If I run compute-sanitizer --tool memcheck on ./sparsetests with the gcc generated file, i get errors like this now an example with sparse matrx multiplication and the mdspan class of course we offload the data first to device ========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: GOMP_OFFLOAD_init_device in plugin-nvptx.c:1371 [0x5677] in libgomp-plugin-nvptx.so.1 ========= Host Frame: gomp_init_device in target.c:3000 [0x3fc05] in libgomp.so.1 ========= Host Frame: resolve_device in target.c:190 [0x3ffd4] in libgomp.so.1 ========= Host Frame: omp_target_alloc in target.c:4667 [0x44fc2] in libgomp.so.1 ========= Host Frame: DataBlock_GPU_Memory_Functions<double>::alloc_device_ptr(unsigned long, int) [0xb173] in sparsetests ========= Host Frame: DataBlock_GPU_Memory_Functions<double>::copy_data_to_device_set_devptr(DataBlock<double>&, int) [0x96a5] in sparsetests ========= Host Frame: mdspan<double, std::vector<unsigned long, std::allocator<unsigned long> > >::device_data_upload(bool, int) [0x8045] in sparsetests ========= Host Frame: main [0x465c] in sparsetests ========= ========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuModuleGetGlobal_v2. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1800 [0x76d7] in libgomp-plugin-nvptx.so.1 ========= Host Frame: gomp_load_image_to_device in target.c:2608 [0x34100] in libgomp.so.1 ========= Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in libgomp.so.1 ========= Host Frame: resolve_device in target.c:190 [0x3ffd4] in libgomp.so.1 ========= Host Frame: omp_target_alloc in target.c:4667 [0x44fc2] in libgomp.so.1 ========= Host Frame: DataBlock_GPU_Memory_Functions<double>::alloc_device_ptr(unsigned long, int) [0xb173] in sparsetests ========= Host Frame: DataBlock_GPU_Memory_Functions<double>::copy_data_to_device_set_devptr(DataBlock<double>&, int) [0x96a5] in sparsetests ========= Host Frame: mdspan<double, std::vector<unsigned long, std::allocator<unsigned long> > >::device_data_upload(bool, int) [0x8045] in sparsetests ========= Host Frame: main [0x465c] in sparsetests ========= ========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuModuleGetFunction. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: nvptx_do_global_cdtors in plugin-nvptx.c:1464 [0x3cdb] in libgomp-plugin-nvptx.so.1 ========= Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1802 [0x7739] in libgomp-plugin-nvptx.so.1 ========= Host Frame: gomp_load_image_to_device in target.c:2608 [0x34100] in libgomp.so.1 ========= Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in libgomp.so.1 ========= Host Frame: resolve_device in target.c:190 [0x3ffd4] in libgomp.so.1 ========= Host Frame: omp_target_alloc in target.c:4667 [0x44fc2] in libgomp.so.1 ========= Host Frame: DataBlock_GPU_Memory_Functions<double>::alloc_device_ptr(unsigned long, int) [0xb173] in sparsetests ========= Host Frame: DataBlock_GPU_Memory_Functions<double>::copy_data_to_device_set_devptr(DataBlock<double>&, int) [0x96a5] in sparsetests ========= Host Frame: mdspan<double, std::vector<unsigned long, std::allocator<unsigned long> > >::device_data_upload(bool, int) [0x8045] in sparsetests ========= Host Frame: main [0x465c] in sparsetests ========= ========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuModuleGetFunction. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: nvptx_do_global_cdtors in plugin-nvptx.c:1482 [0x3ef6] in libgomp-plugin-nvptx.so.1 ========= Host Frame: GOMP_OFFLOAD_load_image in plugin-nvptx.c:1802 [0x7739] in libgomp-plugin-nvptx.so.1 ========= Host Frame: gomp_load_image_to_device in target.c:2608 [0x34100] in libgomp.so.1 ========= Host Frame: gomp_init_device in target.c:3011 [0x3fc77] in libgomp.so.1 ========= Host Frame: resolve_device in target.c:190 [0x3ffd4] in libgomp.so.1 ========= Host Frame: omp_target_alloc in target.c:4667 [0x44fc2] in libgomp.so.1 ========= Host Frame: DataBlock_GPU_Memory_Functions<double>::alloc_device_ptr(unsigned long, int) [0xb173] in sparsetests ========= Host Frame: DataBlock_GPU_Memory_Functions<double>::copy_data_to_device_set_devptr(DataBlock<double>&, int) [0x96a5] in sparsetests ========= Host Frame: mdspan<double, std::vector<unsigned long, std::allocator<unsigned long> > >::device_data_upload(bool, int) [0x8045] in sparsetests ========= Host Frame: main [0x465c] in sparsetests ========= and ========= Invalid __global__ read of size 8 bytes ========= at [clone BlockedDataView<double>::build_blocks_rank2(unsigned long, unsigned long, bool)] _omp_fn$1+0x1530 ========= by thread (0,1,0) in block (0,0,0) ========= Access at 0x7ffeceecd760 is out of bounds ========= and is 243.440.342.369 bytes after the nearest allocation at 0x7fc620c00000 of size 512 bytes ========= Device Frame: gomp_nvptx_main+0x1030 in team.c:135 ========= Device Frame: [clone BlockedDataView<double>::build_blocks_rank2(unsigned long, unsigned long, bool)] _omp_fn$0+0x240 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuLaunchKernel [0x39d6c4] in libcuda.so.1 ========= Host Frame: GOMP_OFFLOAD_run in plugin-nvptx.c:2965 [0xa2d9] in libgomp-plugin-nvptx.so.1 ========= Host Frame: GOMP_target_ext in target.c:3544 [0x4132d] in libgomp.so.1 ========= Host Frame: BlockedDataView<double>::build_blocks_rank2(unsigned long, unsigned long, bool) [0xa3be] in sparsetests ========= Host Frame: BlockedDataView<double>::build_blocks(unsigned long const*, bool) [0x8f0a] in sparsetests ========= Host Frame: BlockedDataView<double>::BlockedDataView(DataBlock<double> const&, unsigned long const*, bool) [0x7447] in sparsetests ========= Host Frame: main [0x4705] in sparsetests Since clang compiles this fine and I can see cuda kernels for the gpu, that seem to be gcc problems. Interesting is this out of bounds error... Why does this appear? If it is out of bounds due to me, then I should also get an out of bounds error with clang. In fact, i checked this with the sourcecode. I can reserve the arrays pooled_offsets_flat and pooled_offsets_starts such that they would fill 2000 elements and i still get this out of bounds error, even when the loop which indices these just runs from 0 to 128... All this does not seem to make sense.. especially since compiling with clang does not show any such errors...
