[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

schulz.benjamin at googlemail dot com via Gcc-bugs Thu, 30 Oct 2025 20:45:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122280


--- Comment #7 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
the correct output was with these cmake parameters for clang



#Clang compiler flags
#SET (ENV{LIBRARY_PATH} "/usr/lib64/nvptx64-nvidia-cuda/:$ENV{LIBRARY_PATH}")
#SET (CMAKE_CXX_COMPILER /usr/lib/llvm/21/bin/clang++  CACHE STRING "C++
compiler" FORCE)
#SET (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}  -std=c++23 -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda  -Wall")


it gives a few warnings because of this simd but I don't care much as long as
it delivers at least correct numbers and gives me no compilation errors. I can
see on nvidia nsight, that it yields code on gpu for the target functions,
after they fixed the device mapper at clang...

But well, lets go to the gcc output.

Thats my gcc version:

eselect gcc list
  [1] nvptx-none-16 *

  [2] x86_64-pc-linux-gnu-14
  [3] x86_64-pc-linux-gnu-15
  [4] x86_64-pc-linux-gnu-16 *
localhost /home/benni # gcc --version
gcc (Gentoo 16.0.0_p20251026 p19) 16.0.0 20251026 (experimental)

and thats its parameters:

#GCC compiler flags.
SET (CMAKE_CXX_COMPILER "g++" CACHE STRING "C++ compiler" FORCE)
SET (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fopenmp -foffload=nvptx-none
-fno-stack-protector -foffload-options=nvptx-none=-march=sm_89 -Wall")


I want to say that I have
nvptx-arch 
sm_120

, but gcc will complain when I want to compile for that... hm....


This is the output.

As you can see at first, there is this nonsensical output for the matrix
multiplication on gpu. As I noticed, this can be removed if i separate the two 

#pragma omp target teams distribute collapse(2) 
for
for 


loops in gpu_mathfunctions.h in the function matrix_multiply_dot_g into two
loops with

#pragma omp target teams distribute 
for
#pragma omp parallel for
for


Note however, that there is a shared statement for the tensors A,B,C so it
really should not matter. 

Note that I can also have the collapse (2) statement in the matrix
multiplication on the host. 

And it works for decades there...

The collapse(2) statement in the matrix multiplication would be a useful
optimization. Say my matrix is 1 gb large and contains only 3 columns.
Separating the two loops at the beginning then into a teams distribute and the
other in a parallel for would just be wasted. Also clang compiles it
correcty....


And then, well, the code stops at the cholesky decomposition. Where are these
CuCtX synchronize errors coming from then in the LU decomposition? These make
no sense.

Here I want to note that I could swear that this code ran in july on gcc 15,
with my old gpu driver, my old kernel and old cuda 12.... (because at that
time, the mapper of clang did not work, so i developed the LU decomposition on
gcc. Whatever went wrong there, I do not know



his demonstrates basic mathematical abilities of the library on gpu, cpu and
with the message passing interface
We can also use a more simplified interface for writing expressions. Although
evaluations of more than one operator are not yet supported.
define A
[[1, 2, 3], 
 [4, 5, 6]]
define B
[[6, 5, 4], 
 [3, 2, 1]]
addition of A and B
[[7, 7, 7], 
 [7, 7, 7]]
multiplication of A and transpose of B
[[28, 10], 
 [73, 28]]
Subtraction of A. one can also assign the type later, as in this example, but
E=A-B would also work here
But here we set a poliy to do this on gpu
[[-5, -3, -1], 
 [1, 3, 5]]
two vectors
[1, 2, 3]
[6, 5, 4]
a scalar product between two vectors
28
28We define two matrices
the same code base can have the strides and extents on heap(vector) or on the
stack(array). 
The library works as well with col major data but in this example, we define
row-major data
Ordinary matrix multiplication, foced on gpu with a policy object
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
 [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1], 
 [2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11], 
 [11, 9, 7, 5, 3, 1, 12, 10, 8, 6, 4, 2], 
 [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], 
 [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], 
 [4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9], 
 [9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10], 
 [5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12], 
 [12, 7, 2, 9, 4, 11, 6, 1, 8, 3, 10, 5], 
 [6, 1, 8, 3, 10, 5, 12, 7, 2, 9, 4, 11], 
 [11, 2, 9, 4, 12, 7, 3, 10, 5, 1, 8, 6]]
[[12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1], 
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
 [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], 
 [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], 
 [5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12], 
 [12, 9, 6, 3, 10, 7, 4, 1, 8, 5, 2, 11], 
 [2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11], 
 [11, 8, 5, 2, 9, 6, 3, 12, 7, 4, 1, 10], 
 [3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10], 
 [10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3], 
 [4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9], 
 [9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10]]
the header In_Kernel_mathfunctions executes math functions either on the host
or can run them in parallel. Abbreviations v just with simd, s without parallel
loops
per default update_host is set to true. If one has several calculations on gpu,
this may not be desired and can be switched to false
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
the header In_Kernel_mathfunctions executes math functions either on the host
or can run them in parallel. Abbreviations w mean with parallel for
per default update_host is set to true. If one has several calculations on gpu,
this may not be desired and can be switched to false
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
CPU_ONLY lets it multiply on CPU. GPU_ONLY executes on gpu. AUTO lets the
library decide based on whether the data is already on gpu, the algorithm, and
the data size.
supplying nullptr instead of a pointer to Math_Functions_Policy lets the
library use a global default that can be configured.
per default update_host is set to true. If one has several calculations on gpu,
this may not be desired and can be switched to false
[[529, 529, 422, 422, 648, 648, 438, 438, 401, 401, 689, 689], 
 [550, 550, 488, 488, 548, 548, 459, 459, 530, 431, 637, 637], 
 [564, 564, 415, 415, 607, 607, 382, 382, 399, 399, 689, 689], 
 [515, 515, 503, 998, 589, 589, 515, 515, 433, 433, 457, 637], 
 [508, 508, 395, 395, 631, 631, 397, 456, 449, 449, 663, 663], 
 [571, 571, 515, 515, 565, 565, 441, 441, 383, 383, 663, 663], 
 [500, 1030, 476, 476, 413, 413, 517, 517, 382, 382, 754, 754], 
 [537, 537, 475, 475, 539, 539, 401, 401, 573, 573, 641, 641], 
 [473, 473, 915, 466, 648, 648, 438, 438, 596, 596, 697, 697], 
 [566, 566, 448, 448, 551, 551, 387, 387, 586, 586, 617, 617], 
 [548, 548, 484, 484, 640, 640, 442, 442, 598, 598, 677, 677], 
 [613, 613, 510, 510, 570, 570, 474, 474, 537, 537, 359, 359]]
We can also use the Strassen algorithm or its Winograd variant for the
multiplication.
It may offload on gpu. With the Message Passing Interface enabled, it can do so
in parallel. 
otherwise it offloads sequentially. The algorithm can also work entirely on
device with devicepointers to the data
in auto mode, the following default treshholds are set in mathfunctions.h and
can be changed for convenience
max_problem_size_for_gpu;This is the size of the gpu memory, data larger than
this is not offloaded
 default_cubic_treshold = 256;The default number of elements at which matrices
are auto offloaded in multiplication
 default_square_treshold = 1000;The default number of elements at which
matrices are auto offloaded for addition
 default_linear_treshold = 1000000;The default number of elements at which
vectors are auto offloaded for addition

we now set it on gpu and set the size when to stop recursion to 2, per default,
this is at 64
[[541, 529, 457, 422, 516, 648, 414, 438, 640, 401, 389, 689], 
 [525, 550, 479, 488, 511, 548, 470, 459, 530, 431, 456, 637], 
 [575, 564, 433, 415, 486, 607, 477, 382, 669, 399, 388, 689], 
 [491, 515, 503, 495, 541, 589, 407, 515, 501, 433, 457, 637], 
 [557, 508, 435, 395, 560, 631, 397, 456, 633, 449, 400, 663], 
 [509, 571, 501, 515, 467, 565, 487, 441, 537, 383, 445, 663], 
 [500, 530, 476, 531, 413, 551, 499, 517, 519, 382, 412, 754], 
 [587, 537, 451, 475, 539, 609, 439, 401, 573, 441, 391, 641], 
 [485, 473, 449, 466, 516, 648, 414, 438, 596, 457, 445, 697], 
 [561, 566, 523, 448, 551, 616, 418, 387, 586, 403, 408, 617], 
 [549, 548, 427, 484, 509, 640, 442, 405, 598, 403, 402, 677], 
 [572, 613, 510, 507, 457, 570, 474, 491, 537, 318, 359, 676]]
We create a 4x4 matrix that owns its own data buffer in a memapped file and
then fill the buffer and print it
usually, the own data buffer is more interesting for storing the results of the
computation and for intermediary evaluations
[[0, 1, 2, 3], 
 [4, 5, 6, 7], 
 [8, 9, 10, 11], 
 [12, 13, 14, 15]]
now we create a 4x4 matrix with data in a separate vector
[[2, 2, 2, 2], 
 [2, 2, 2, 2], 
 [2, 2, 2, 2], 
 [2, 2, 2, 2]]
now we make a shallow copy of the first matrix on the second
[[0, 1, 2, 3], 
 [4, 5, 6, 7], 
 [8, 9, 10, 11], 
 [12, 13, 14, 15]]
We test the shallow copy by setting the first element of the first matrix to 42
and then print the first and second matrix
[[42, 1, 2, 3], 
 [4, 5, 6, 7], 
 [8, 9, 10, 11], 
 [12, 13, 14, 15]]
[[42, 1, 2, 3], 
 [4, 5, 6, 7], 
 [8, 9, 10, 11], 
 [12, 13, 14, 15]]
Now we test more advanced algorithms




Now a cholesky decomposition on CPU
The result is put in an mdspan_data, which allocates its own ressourceswith the
dataset
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]
[[14.4914, 0, 0, 0, 0, 0, 0, 0], 
 [-6.3486, 16.6642, 0, 0, 0, 0, 0, 0], 
 [4.69245, -4.2132, 12.8152, 0, 0, 0, 0, 0], 
 [-2.27722, 6.9336, -4.37774, 16.2965, 0, 0, 0, 0], 
 [-2.34622, -10.0752, 0.74604, -5.16795, 14.5506, 0, 0, 0], 
 [-0.276026, -3.94573, -6.58037, -3.257, -2.84005, 9.86812, 0, 0], 
 [8.14277, 12.7036, -0.0535879, -3.54515, 6.79111, -1.94966, 5.46098, 0], 
 [-0.414039, 1.82256, -1.27804, 0.173372, -0.395814, 0.314913, -1.63587,
15.1958]]
we can verify the cholesky decomposition by multiplication
We can create a transpose with the base class DataBlock, but also with mdspan
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]
Now the cholesky decomposition is entirely done on GPU
[[14.4914, 0, 0, 0, 0, 0, 0, 0], 
 [-6.3486, 16.6642, 0, 0, 0, 0, 0, 0], 
 [4.69245, -4.2132, 12.8152, 0, 0, 0, 0, 0], 
 [-2.27722, 6.9336, -4.37774, 16.2965, 0, 0, 0, 0], 
 [-2.34622, -10.0752, 0.74604, -5.16795, 14.5506, 0, 0, 0], 
 [-0.276026, -3.94573, -6.58037, -3.257, -2.84005, 9.86812, 0, 0], 
 [8.14277, 12.7036, -0.0535879, -3.54515, 6.79111, -1.94966, 5.46098, 0], 
 [-0.414039, 1.82256, -1.27804, 0.173372, -0.395814, 0.314913, -1.63587,
15.1958]]
we can verify the cholesky decomposition by multiplication
Here we create the transpose with mdspan
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]
With the advanced algorithms on GPU
[[210, -92, 68, -33, -34, -4, 118, -6], 
 [-92, 318, -100, 130, -153, -64, 160, 33], 
 [68, -100, 204, -96, 41, -69, -16, -26], 
 [-33, 130, -96, 338, -152, -51, 12, 22], 
 [-34, -153, 41, -152, 346, 11, -30, -25], 
 [-4, -64, -69, -51, 11, 175, -79, 5], 
 [118, 160, -16, 12, -30, -79, 320, 7], 
 [-6, 33, -26, 22, -25, 5, 7, 239]]

libgomp: cuCtxSynchronize error: an illegal memory access was encountered

libgomp: cuModuleGetFunction (__do_global_dtors__entry) error: an illegal
memory access was encountered

libgomp: cuMemFree_v2 error: an illegal memory access was encountered

libgomp: device finalization failed

Process returned 1 (0x1)   execution time : 6.770 s
Press ENTER to continue.

[Bug libgomp/122280] target teams distribute parallel for collapse(2) yields different results in a matmul than separate loops (one with omp target teams distribute the second with omp parallel for) on nvptx target. Clang compiles the code correctly

Reply via email to