[Bug target/123272] [nvptx] OpenMP, libgomp, miscompilation in matrix multiplication with #pragma omp target teams parallel for collapse(2) if members of classes use templates on nvptx target

schulz.benjamin at googlemail dot com via Gcc-bugs Fri, 26 Dec 2025 19:02:02 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123272


--- Comment #4 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Created attachment 63164
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63164&action=edit
mein.cpp.omplower

Hi there, as for the matrix multiplication problem, since i am unexperienced
with gcc sources, I asked chatgpt. 

After convincing it that my code obeys OpenMP standard, I gave it a link to the
gcc source...

It sugested this is an oplowering bug, told me to compile with  -fdump-tree-all
and look at the files *.omplover and *.opmpext.

The result is indeed very worrying. 

First, I do not really understand why exactly it maps A,B,C in the target teams
distribute construct of matrix_multiply_dot_g1 and matrix_multiply_dot_g2,
since the pointers to the struct are already mapped with target enter data
earlier. But perhaps it just maps them into the threads.

What I've seen in cuda sanitizer is, however, that gcc makes generally much
more host/device copies than clang. This slows everything down.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121178#c1 clang makes a minmal
number of offloads and even uses asynchroneous offloading strategies here,
making all applications feel much faster.


But mappings and copies are not the reason for the wrong results that are topic
of this bug.


In the code for the matrix multiplications only the outer loops are
parallelized while the inner loop is single threaded:

  #pragma omp target teams distribute parallel for collapse(2) device(dev)
        for (size_t i = 0; i < rows; ++i)
            for (size_t j = 0; j < cols; ++j)
            {
                T sum = T(0);

                for (size_t k = 0; k < inner_dim; ++k)
                {
                    sum += A.dpdata[i*Astr0+k*Astr1]
*B.dpdata[k*Bstr0+j*Bstr1];
                }
                C.dpdata[i*Cstr0+j*Cstr1]= sum;


For non-templated matrix_multiply_dot_g2 in the attached omplover file, the
variable sum is accordingly declared in front of the thread private loop on
line 235, within the parallel loops, and initialized, which gives the correct
result.


Astr1 = .omp_data_i->Astr1;
                                          Astr0 = .omp_data_i->Astr0;
                                          {
                                            long unsigned int rows.30;
                                            long unsigned int cols.31;
                                            double sum;
                                            size_t i;
                                            size_t j;

                                            try
                                              {
                                                rows.30 = rows;
                                                cols.31 = cols;
                                                #pragma omp for
_looptemp_(D.146209) _looptemp_(D.146211) _looptemp_(D.146213) collapse(2)
nowait private(j) private(i)
                                                for (i = 0; i < rows.30; i = i
+ 1)
                                                  for (j = 0; j < cols.31; j =
j + 1)
                                                {
                                                  sum = 0.0;
                                                  {
                                                    size_t k;


Yet despite the templated code being exactly the same c++ code, in the
templated version, named matrix_multiply_dot_g1, we find that on its own, gcc
declared variable sum much earlier, in line 2444 within the collapsed parallel
loops. 


You can find a parallel loop on line 2459 later, and then an update of sum on
line 2478, another parallel loop on line 2479 and only then comes the inner
loop where sum should be declared



                      size_t i;
                      size_t j;
                      double sum;

                      {
                        long unsigned int rows.126;
                        long unsigned int cols.127;
                        unsigned long .iter.128;
                        unsigned long .count.129;
                        size_t i.97;
                        size_t j.98;

                        try
                          {
                            rows.126 = rows;
                            cols.127 = cols;
                            #pragma omp distribute collapse(2) private(j.98)
private(i.97)
                            for (i.97 = 0; i.97 < rows.126; i.97 = i.97 + 1)
                              for (j.98 = 0; j.98 < cols.127; j.98 = j.98 + 1)
                            {
                              {
                                .omp_data_o.125.D.146784 = D.146779;
                                .omp_data_o.125.D.146786 = D.146778;
                                .omp_data_o.125.D.146788 = D.146777;
                                .omp_data_o.125.inner_dim = inner_dim;
                                .omp_data_o.125.cols = cols;
                                .omp_data_o.125.rows = rows;
                                .omp_data_o.125.Cstr1 = Cstr1;
                                .omp_data_o.125.Cstr0 = Cstr0;
                                .omp_data_o.125.Bstr1 = Bstr1;
                                .omp_data_o.125.Bstr0 = Bstr0;
                                .omp_data_o.125.Astr1 = Astr1;
                                .omp_data_o.125.Astr0 = Astr0;
                                .omp_data_o.125.A = A;
                                .omp_data_o.125.B = B;
                                .omp_data_o.125.C = C;
                                .omp_data_o.125.sum = sum;
                                #pragma omp parallel _looptemp_(D.146779)
_looptemp_(D.146778) _looptemp_(D.146777) shared(sum) firstprivate(inner_dim)
firstprivate(cols) firstprivate(rows) firstprivate(Cstr1) firstprivate(Cstr0)
firstprivate(Bstr1) firstprivate(Bstr0) firstprivate(Astr1) firstprivate(Astr0)
shared(C) shared(B) shared(A) [child fn:
_Z22matrix_multiply_dot_g1IdEvRK10DataBlock1IT_ES4_RS2_i._omp_fn.1
(.omp_data_o.125)]
                                  {
                                    try
                                      {
                                        .omp_data_i = (struct .omp_data_s.109 &
restrict) &.omp_data_o.125;
                                        D.146785 = .omp_data_i->D.146784;
                                        D.146787 = .omp_data_i->D.146786;
                                        D.146789 = .omp_data_i->D.146788;
                                        inner_dim = .omp_data_i->inner_dim;
                                        cols = .omp_data_i->cols;
                                        rows = .omp_data_i->rows;
                                        Cstr1 = .omp_data_i->Cstr1;
                                        Cstr0 = .omp_data_i->Cstr0;
                                        Bstr1 = .omp_data_i->Bstr1;
                                        Bstr0 = .omp_data_i->Bstr0;
                                        Astr1 = .omp_data_i->Astr1;
                                        Astr0 = .omp_data_i->Astr0;
                                        {
                                          long unsigned int rows.123;
                                          long unsigned int cols.124;
                                          size_t i;
                                          size_t j;

                                          try
                                            {
                                              rows.123 = rows;
                                              cols.124 = cols;
                                              #pragma omp for
_looptemp_(D.146785) _looptemp_(D.146787) _looptemp_(D.146789) collapse(2)
nowait private(j) private(i)
                                              for (i = 0; i < rows.123; i = i +
1)
                                                for (j = 0; j < cols.124; j = j
+ 1)
                                              .omp_data_i->sum = 0.0;
                                              {
                                                size_t k;

                                                k = 0;
                                                goto <D.146818>;
                                                <D.146819>:
                                                D.146858 = .omp_data_i->A;
                                                D.146710 = D.146858->dpdata;


I then asked chatgpt where the problem could be...

It answered the following:


In your templated case, during OpenMP lowering, GCC:

walks variables referenced in a target / parallel region, sees sum, incorrectly
treats it as capturable, hoists it into .omp_data_o, marks it shared(sum)

This happens even though:

sum is declared inside the structured block, sum has automatic storage, sum
does not escape, sum is not referenced outside the loop body

This is illegal per both:

C++ scoping rules, OpenMP data-sharing rules

The bug is classification, not codegen.


Where to patch (exact files & function)

You already found the right function:

gcc/omp-general.cc
Primary culprit
omp_maybe_apply_defaultmap ()


This function decides:

which variables are mapped, which are firstprivate, which are shared.
It is called before some scalarization when templates are involved at -O0.

Rule that GCC must enforce:

A variable declared inside the structured block of an OpenMP region
must NEVER be captured into .omp_data_*

This rule is unconditional.

So the fix is:

If a variable’s DECL_CONTEXT is inside the OpenMP region’s body,
do not apply defaultmap logic to it.

Minimal, correct patch (illustrative):

This is not a full diff, but this is the exact logic you want.

In gcc/omp-general.cc

Locate omp_maybe_apply_defaultmap().

Add a guard like this before any defaultmap classification:

static bool
omp_declared_in_region_body (tree decl, tree region)
{
  if (!DECL_P (decl))
    return false;

  tree ctx = DECL_CONTEXT (decl);
  if (!ctx)
    return false;

  /* If the variable was declared inside the OpenMP structured block,
     it must be private and must not be captured. */
  if (TREE_CODE (ctx) == BLOCK
      && BLOCK_SUPERCONTEXT (ctx) == region)
    return true;

  return false;
}


Then, inside omp_maybe_apply_defaultmap():

/* NEW: Skip variables declared inside the region body */
if (omp_declared_in_region_body (decl, region))
  {
    /* Explicitly mark as private and skip mapping */
    omp_mark_decl_private (decl);
    continue;
  }


That is what an AI says, which should be treated like the opinion of a human
who had a short look at gcc's code. 

I just want to report this...

I may test the patch after I had a better look at what gcc is doing there.


What I see in the omplower is that the templated code is very! different than
the non-templated code, despite the original c++ code being identical. That
should not be.

So more much much fixes may be necessary in gcc before OpenMP can work with gpu
offloads and templated structs and classes.


I have written a library at
https://github.com/bschulz81/AcceleratedLinearAlgebra/tree/main

which uses openmp offload and similar templated classes and structs... and of
course also things like inherited templated classes (in datablock container)...
and many openmp constructs with and without offloading. It comes with a few
test applications...

At the moment, the library works with clang without errors in cuda sanitizer,
but the test applications that offload to gpu crash when compiled with gcc

To see whether the openmp code is correct, compiling with  -fdump-tree-all and
then looking for omplower and ompext to find wrong code lines may be worthwhile
as the library is basically a bunch of offload testcases with templates...

You may also try to enrich the OpenMP structs, e.g. with the tile construct,
the loop construct unroll and unroll partial constructs, atomics pragmas,
reductions, collapse statements, device pointers and whatever... and then see
if the templated code is still correct... 

At the moment, gcc fails to generate valid openmp offload code for a very
simple matrix multiplication once templated classes are involved, so failure is
expected even more for the more involved and difficult test cases from my
library.

[Bug target/123272] [nvptx] OpenMP, libgomp, miscompilation in matrix multiplication with #pragma omp target teams parallel for collapse(2) if members of classes use templates on nvptx target

Reply via email to