https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120737

            Bug ID: 120737
           Summary: #pragma omp atomic fails on nvptx
           Product: gcc
           Version: 15.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schulz.benjamin at googlemail dot com
  Target Milestone: ---

Created attachment 61673
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61673&action=edit
compile with -g -O3 -fopenmp -foffload=nvptx-none  -fno-stack-protector
-std=c++23 -lm -lc result will yield 0 instead of 2740. removing the target
statement in the last loop will yield correct result

Hi there, I noticed that 

#pragma omp target teams distribute does, by the OpenMP standard 5x, not
support a reduction clause 

https://www.openmp.org/spec-html/5.0/openmpse15.html#x57-910002.7
https://www.openmp.org/spec-html/5.0/openmpsu73.html

GCC neverhteless seems to allow it as an extension and seems to work correctly 

if we fill an array:

size_t elements=20;
std::vector<double> v1(elements),v2(elements);
#pragma omp parallel for simd
for(size_t i=1;i<elements;i++)
{
   v1[i]=(double)i;
    v2[i]=(double)i;
}

double* v1d=v1.data(),*v2d=v2.data();

#pragma omp target enter data map (to:v1d[0:elements])  
#pragma omp target enter data map (to:v2d[0:elements]) 

double tmp=0;
#pragma omp target enter data map (to:tmp)

// this reduction is not really allowed by the spec but works in gcc.

#pragma omp target teams distribute reduction(+:tmp)
    for(size_t i=1;i<20;i++)
    {
        #pragma omp atomic
       tmp+=v1d[i]*v2d[i];
    }
  std::cout<<tmp<<"\n";



At least it yields the same result as a loop with reduction on the host and

#pragma omp parallel for reduction(+:tmp)

The correct result should be 2470, and that is what we also get with the
technically unallowed reduction.


In order to make my code prepared for more sycophantic compilers than gcc, I
wanted to turn this into standards compilant code.

According to the OpenMP standard, the teams construct supports a shared
directive for variables.

https://www.openmp.org/spec-html/5.0/openmpse15.html#x57-910002.7


#pragma omp target teams distribute shared(tmp) is standards compilant if tmp
was mapped.

And now, one has #pragma omp atomic so that only one thread can access a single
variable at a time. Since teams distribute is just a league of threads, omp
atomic should work in the same way.

The following should then mimic the reduction in a OpenMP standard compilant
way. It just fails


#include <omp.h>
#include <vector>
#include <iostream>
int main()
{
    size_t elements=20;
    std::vector<double> v1(elements),v2(elements);

    #pragma omp parallel for simd
    for(size_t i=1;i<elements;i++)
      {
          v1[i]=(double)i;
          v2[i]=(double)i;
      }

    double* v1d=v1.data(),*v2d=v2.data();


#pragma omp target enter data map (to:v1d[0:elements]) 
#pragma omp target enter data map (to:v2d[0:elements])  


double tmp=0;

  //  #pragma omp parallel for shared(tmp)
   #pragma omp target teams distribute shared(tmp)
    for(size_t i=1;i<20;i++)
    {
        #pragma omp atomic
       tmp+=v1d[i]*v2d[i];
    }
  std::cout<<tmp<<"\n";
}

yields 0 with gcc 15.1 instead of 2470


Now I asked myself whether this is because of the target distribute or the omp
atomic statement...

So we replace the target teams distribute by a target parallel for statement:


this here:

#pragma omp target enter data map (to:tmp)


   #pragma omp target parallel for shared(tmp)
    for(size_t i=1;i<20;i++)
    {
        #pragma omp atomic
       tmp+=v1d[i]*v2d[i];
    }

#pragma omp target exit data map (from:tmp)
  std::cout<<tmp<<"\n";
}

also yields zero  instead of 2740.

The code is, however, correct. If we remove the target exit data statement and
replace the target parallel for by a host loop with just parallel for, then the
code yields 2740..

It also yields 0 on the device if we replace the 

#pragma omp atomic by critical. this here:


double tmp=0;
#pragma omp target enter data map (to:v1d[0:elements])
#pragma omp target enter data map (to:v2d[0:elements])

   #pragma omp target parallel for shared(tmp)
    for(size_t i=1;i<20;i++)
    {
        #pragma omp critical
       tmp+=v1d[i]*v2d[i];

    }

  std::cout<<tmp<<"\n";

Will also yield 0 instead of 2740.

Reply via email to