https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120737
Bug ID: 120737 Summary: #pragma omp atomic fails on nvptx Product: gcc Version: 15.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: schulz.benjamin at googlemail dot com Target Milestone: --- Created attachment 61673 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61673&action=edit compile with -g -O3 -fopenmp -foffload=nvptx-none -fno-stack-protector -std=c++23 -lm -lc result will yield 0 instead of 2740. removing the target statement in the last loop will yield correct result Hi there, I noticed that #pragma omp target teams distribute does, by the OpenMP standard 5x, not support a reduction clause https://www.openmp.org/spec-html/5.0/openmpse15.html#x57-910002.7 https://www.openmp.org/spec-html/5.0/openmpsu73.html GCC neverhteless seems to allow it as an extension and seems to work correctly if we fill an array: size_t elements=20; std::vector<double> v1(elements),v2(elements); #pragma omp parallel for simd for(size_t i=1;i<elements;i++) { v1[i]=(double)i; v2[i]=(double)i; } double* v1d=v1.data(),*v2d=v2.data(); #pragma omp target enter data map (to:v1d[0:elements]) #pragma omp target enter data map (to:v2d[0:elements]) double tmp=0; #pragma omp target enter data map (to:tmp) // this reduction is not really allowed by the spec but works in gcc. #pragma omp target teams distribute reduction(+:tmp) for(size_t i=1;i<20;i++) { #pragma omp atomic tmp+=v1d[i]*v2d[i]; } std::cout<<tmp<<"\n"; At least it yields the same result as a loop with reduction on the host and #pragma omp parallel for reduction(+:tmp) The correct result should be 2470, and that is what we also get with the technically unallowed reduction. In order to make my code prepared for more sycophantic compilers than gcc, I wanted to turn this into standards compilant code. According to the OpenMP standard, the teams construct supports a shared directive for variables. https://www.openmp.org/spec-html/5.0/openmpse15.html#x57-910002.7 #pragma omp target teams distribute shared(tmp) is standards compilant if tmp was mapped. And now, one has #pragma omp atomic so that only one thread can access a single variable at a time. Since teams distribute is just a league of threads, omp atomic should work in the same way. The following should then mimic the reduction in a OpenMP standard compilant way. It just fails #include <omp.h> #include <vector> #include <iostream> int main() { size_t elements=20; std::vector<double> v1(elements),v2(elements); #pragma omp parallel for simd for(size_t i=1;i<elements;i++) { v1[i]=(double)i; v2[i]=(double)i; } double* v1d=v1.data(),*v2d=v2.data(); #pragma omp target enter data map (to:v1d[0:elements]) #pragma omp target enter data map (to:v2d[0:elements]) double tmp=0; // #pragma omp parallel for shared(tmp) #pragma omp target teams distribute shared(tmp) for(size_t i=1;i<20;i++) { #pragma omp atomic tmp+=v1d[i]*v2d[i]; } std::cout<<tmp<<"\n"; } yields 0 with gcc 15.1 instead of 2470 Now I asked myself whether this is because of the target distribute or the omp atomic statement... So we replace the target teams distribute by a target parallel for statement: this here: #pragma omp target enter data map (to:tmp) #pragma omp target parallel for shared(tmp) for(size_t i=1;i<20;i++) { #pragma omp atomic tmp+=v1d[i]*v2d[i]; } #pragma omp target exit data map (from:tmp) std::cout<<tmp<<"\n"; } also yields zero instead of 2740. The code is, however, correct. If we remove the target exit data statement and replace the target parallel for by a host loop with just parallel for, then the code yields 2740.. It also yields 0 on the device if we replace the #pragma omp atomic by critical. this here: double tmp=0; #pragma omp target enter data map (to:v1d[0:elements]) #pragma omp target enter data map (to:v2d[0:elements]) #pragma omp target parallel for shared(tmp) for(size_t i=1;i<20;i++) { #pragma omp critical tmp+=v1d[i]*v2d[i]; } std::cout<<tmp<<"\n"; Will also yield 0 instead of 2740.