https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122281
Benjamin Schulz <schulz.benjamin at googlemail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #62558|0 0 |1 1
is obsolete| |
--- Comment #13 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Created attachment 62741
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62741&action=edit
Testprograms.tar.gz
Hi,
Unfortunately, for my system, the problem did not go away!
I emerged the most recent gcc 16.0.9999 from gentoo (which contains the current
git trunk. The following are the installed versions:
eselect gcc list
[1] nvptx-none-16 *
[2] x86_64-pc-linux-gnu-14
[3] x86_64-pc-linux-gnu-15
[4] x86_64-pc-linux-gnu-16 *
localhost /home/benni # gcc --version
gcc (Gentoo 16.0.9999 p, commit 4941111171c76cc7641513924a9313a02fc5f621)
16.0.0 20251109 (experimental) 9703ab271a157d944957e9d979b64337371b11c8
I ensured that the cross compiler is the same as the host (i.e. the same day
from git)
this is my hardware and gpu driver:
nvidia-smi
Sun Nov 9 16:25:37 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA
Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util
Compute M. |
| | |
MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5060 Ti Off | 00000000:2D:00.0 On |
N/A |
| 0% 50C P8 12W / 180W | 642MiB / 16311MiB | 0%
Default |
| | |
N/A |
I installed an old cuda 12 since I found that the offload compiler gcc does not
compile at all with cuda 13.
emerge -pv nvidia-cuda-toolkit
These are the packages that would be merged, in order:
Calculating dependencies... done!
Dependency resolution took 3.95 s (backtrack: 0/20).
[ebuild R ] dev-util/nvidia-cuda-toolkit-12.9.1-r1:0/12.9.1::gentoo
USE="debugger examples nsight profiler rdma sanitizer -clang"
PYTHON_TARGETS="python3_13 -python3_11 -python3_12" 0 KiB
Unfortunately, if I run sparsetests, I get this output.
[[0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0],
[2, 2, 2, 2, 0, 0, 0, 0],
[3, 3, 3, 3, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]]
sparsity 0.8125
naive matrix multiplication
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
We now do a sparse multiplication
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
now an example with sparse matrx multiplication and the mdspan class
of course we offload the data first to device
sparsity 0.8125
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
Process returned 1 (0x1) execution time : 0.293 s
Press ENTER to continue.
On my system, the function sparsity ececutes correctly on device. What does not
work is this:
in line 97 of sparsetests.cpp
BlockedDataView<double> Ablocks1(Aspan, block_shape,true);
BlockedDataView<double> Bblocks2(Bspan, block_shape2,true);
And this constructor calls the method which I cited in my first post...
And this fails despite the memory was allocated correctly on device (I reserved
the correct amount and so on with omp_target_alloc.)
I don't know why this fails with gcc.
When I compile it with clang, I get this:
[[0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0],
[2, 2, 2, 2, 0, 0, 0, 0],
[3, 3, 3, 3, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]]
sparsity 0.8125
naive matrix multiplication
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
We now do a sparse multiplication
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
now an example with sparse matrx multiplication and the mdspan class
of course we offload the data first to device
sparsity 0.8125
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
And no memory errors.
If I run the program mathdemonstrations, I get this one at the end when
compiled with gcc:
we can verify the cholesky decomposition by multiplication
Here we create the transpose with mdspan
[[210, -92, 68, -33, -34, -4, 118, -6],
[-92, 318, -100, 130, -153, -64, 160, 33],
[68, -100, 204, -96, 41, -69, -16, -26],
[-33, 130, -96, 338, -152, -51, 12, 22],
[-34, -153, 41, -152, 346, 11, -30, -25],
[-4, -64, -69, -51, 11, 175, -79, 5],
[118, 160, -16, 12, -30, -79, 320, 7],
[-6, 33, -26, 22, -25, 5, 7, 239]]
With the advanced algorithms on GPU
[[210, -92, 68, -33, -34, -4, 118, -6],
[-92, 318, -100, 130, -153, -64, 160, 33],
[68, -100, 204, -96, 41, -69, -16, -26],
[-33, 130, -96, 338, -152, -51, 12, 22],
[-34, -153, 41, -152, 346, 11, -30, -25],
[-4, -64, -69, -51, 11, 175, -79, 5],
[118, 160, -16, 12, -30, -79, 320, 7],
[-6, 33, -26, 22, -25, 5, 7, 239]]
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
Process returned 1 (0x1) execution time : 0.901 s
With clang, the program, after the Cholesky decomposition finishes with 3
algorithms, does 3 LU decompositions with different algorithms on host and gpu,
then 3 qr decompositions with different algorithms on host and gou and then
exits. I don't know why I get these crashes with gcc.
And It seems not to be that simt problem on my machine. I can remove that simd
pragma in the function sparsity of DataBlock.h. It slipped there anyway a bit
by accident, I usually don't do if clauses with simd or simt. But if I remove
this, I still get these problems...
Whatever this is....
By the way, I am living in munich. If you would want, I could give you my card
for a weekend then we can find out what this is...
I actually find this more scary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280
A matrix multiplication with target teams parallel for collapse(2) on the first
two loops. On the host collapse(2) is totally OK here, And With clang, its is
also OK on device. But with gcc, the collapse(2) implies that indeterminism
emerges and strange numbers are put out sometimes...
My suggestion may be that there is a problem with memory reservation and cuda
13 supported by the new driver in gcc.
The scary thing in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280 is that the code does not
even produce an error. Just nonsense values.
Clang always gets this right. And the gpu is rather new...
This points somewhat against a problem with my hardware...
Could of course be that suddenly, my blackwell gpu is defect and clang produces
working code with it...
--- Comment #14 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Created attachment 62742
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62742&action=edit
Testprograms.tar.gz
Hi,
Unfortunately, for my system, the problem did not go away!
I emerged the most recent gcc 16.0.9999 from gentoo (which contains the current
git trunk. The following are the installed versions:
eselect gcc list
[1] nvptx-none-16 *
[2] x86_64-pc-linux-gnu-14
[3] x86_64-pc-linux-gnu-15
[4] x86_64-pc-linux-gnu-16 *
localhost /home/benni # gcc --version
gcc (Gentoo 16.0.9999 p, commit 4941111171c76cc7641513924a9313a02fc5f621)
16.0.0 20251109 (experimental) 9703ab271a157d944957e9d979b64337371b11c8
I ensured that the cross compiler is the same as the host (i.e. the same day
from git)
this is my hardware and gpu driver:
nvidia-smi
Sun Nov 9 16:25:37 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA
Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util
Compute M. |
| | |
MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5060 Ti Off | 00000000:2D:00.0 On |
N/A |
| 0% 50C P8 12W / 180W | 642MiB / 16311MiB | 0%
Default |
| | |
N/A |
I installed an old cuda 12 since I found that the offload compiler gcc does not
compile at all with cuda 13.
emerge -pv nvidia-cuda-toolkit
These are the packages that would be merged, in order:
Calculating dependencies... done!
Dependency resolution took 3.95 s (backtrack: 0/20).
[ebuild R ] dev-util/nvidia-cuda-toolkit-12.9.1-r1:0/12.9.1::gentoo
USE="debugger examples nsight profiler rdma sanitizer -clang"
PYTHON_TARGETS="python3_13 -python3_11 -python3_12" 0 KiB
Unfortunately, if I run sparsetests, I get this output.
[[0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0],
[2, 2, 2, 2, 0, 0, 0, 0],
[3, 3, 3, 3, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]]
sparsity 0.8125
naive matrix multiplication
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
We now do a sparse multiplication
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
now an example with sparse matrx multiplication and the mdspan class
of course we offload the data first to device
sparsity 0.8125
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
Process returned 1 (0x1) execution time : 0.293 s
Press ENTER to continue.
On my system, the function sparsity ececutes correctly on device. What does not
work is this:
in line 97 of sparsetests.cpp
BlockedDataView<double> Ablocks1(Aspan, block_shape,true);
BlockedDataView<double> Bblocks2(Bspan, block_shape2,true);
And this constructor calls the method which I cited in my first post...
And this fails despite the memory was allocated correctly on device (I reserved
the correct amount and so on with omp_target_alloc.)
I don't know why this fails with gcc.
When I compile it with clang, I get this:
[[0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0],
[2, 2, 2, 2, 0, 0, 0, 0],
[3, 3, 3, 3, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]]
sparsity 0.8125
naive matrix multiplication
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
We now do a sparse multiplication
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
now an example with sparse matrx multiplication and the mdspan class
of course we offload the data first to device
sparsity 0.8125
[[0, 0, 0, 0, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[24, 24, 24, 24, 0, 0, 0, 0],
[36, 36, 36, 36, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[6, 6, 6, 6, 0, 0, 0, 0],
[12, 12, 12, 12, 0, 0, 0, 0],
[18, 18, 18, 18, 0, 0, 0, 0]]
And no memory errors.
If I run the program mathdemonstrations, I get this one at the end when
compiled with gcc:
we can verify the cholesky decomposition by multiplication
Here we create the transpose with mdspan
[[210, -92, 68, -33, -34, -4, 118, -6],
[-92, 318, -100, 130, -153, -64, 160, 33],
[68, -100, 204, -96, 41, -69, -16, -26],
[-33, 130, -96, 338, -152, -51, 12, 22],
[-34, -153, 41, -152, 346, 11, -30, -25],
[-4, -64, -69, -51, 11, 175, -79, 5],
[118, 160, -16, 12, -30, -79, 320, 7],
[-6, 33, -26, 22, -25, 5, 7, 239]]
With the advanced algorithms on GPU
[[210, -92, 68, -33, -34, -4, 118, -6],
[-92, 318, -100, 130, -153, -64, 160, 33],
[68, -100, 204, -96, 41, -69, -16, -26],
[-33, 130, -96, 338, -152, -51, 12, 22],
[-34, -153, 41, -152, 346, 11, -30, -25],
[-4, -64, -69, -51, 11, 175, -79, 5],
[118, 160, -16, 12, -30, -79, 320, 7],
[-6, 33, -26, 22, -25, 5, 7, 239]]
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
Process returned 1 (0x1) execution time : 0.901 s
With clang, the program, after the Cholesky decomposition finishes with 3
algorithms, does 3 LU decompositions with different algorithms on host and gpu,
then 3 qr decompositions with different algorithms on host and gou and then
exits. I don't know why I get these crashes with gcc.
And It seems not to be that simt problem on my machine. I can remove that simd
pragma in the function sparsity of DataBlock.h. It slipped there anyway a bit
by accident, I usually don't do if clauses with simd or simt. But if I remove
this, I still get these problems...
Whatever this is....
By the way, I am living in munich. If you would want, I could give you my card
for a weekend then we can find out what this is...
I actually find this more scary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280
A matrix multiplication with target teams parallel for collapse(2) on the first
two loops. On the host collapse(2) is totally OK here, And With clang, its is
also OK on device. But with gcc, the collapse(2) implies that indeterminism
emerges and strange numbers are put out sometimes...
My suggestion may be that there is a problem with memory reservation and cuda
13 supported by the new driver in gcc.
The scary thing in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280 is that the code does not
even produce an error. Just nonsense values.
Clang always gets this right. And the gpu is rather new...
This points somewhat against a problem with my hardware...
Could of course be that suddenly, my blackwell gpu is defect and clang produces
working code with it...