https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125638

            Bug ID: 125638
           Summary: openmp performance problem #pragma omp simd is
                    necessary to get fast vectorized loops on gpu, whereas
                    on cpu, automatic optimization is much faster. than
                    the simd statement
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schulz.benjamin at googlemail dot com
  Target Milestone: ---

Created attachment 64648
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=64648&action=edit
main.cpp

statements like 

#pragma omp begin declare target
#pragma omp end declare target 

allow it to compile code that is available for both cpu and gpu targets.

Now gcc has a powerful optimizer, at least for cpu, that can decide how to
optimize and vectorize code at best. 

Unfortunately, it is not that intelligent for gpu, at least not for the nvptx
backend. 

And so it can be that code without 

#pragma omp simd 

is faster on cpu 

while on gpu it needs the 

#pragma omp simd

statement for reasonable speed. 



This is inconvenient for a function within a 
#pragma omp begin declare target 
#pragma omp end declare target

region that is available for both gpu and cpu.



For example, take a simple class like:

class DataBlock2
{
public:
    size_t* dpextents;
    size_t* dpstrides;
    double* dpdata;
    size_t dpdatalength;
    DataBlock2(size_t *ext,size_t *str, double* dat, size_t datlength):
        dpextents(ext),dpstrides(str),dpdata(dat),dpdatalength(datlength) {}
};

and write a function for matrix multiplication:


void matrix_multiply_dot_g1(  const DataBlock2& A, const  DataBlock2& B,
DataBlock2& C)
{
    const size_t Astr0=A.dpstrides[0];
    const size_t Astr1=A.dpstrides[1];
    const size_t Bstr0=B.dpstrides[0];
    const size_t Bstr1=B.dpstrides[1];
    const size_t Cstr0=C.dpstrides[0];
    const size_t Cstr1=C.dpstrides[1];

    const size_t rows=A.dpextents[0];
    const size_t cols=B.dpextents[1];
    const size_t inner_dim=A.dpextents[1];

    auto t11 = high_resolution_clock::now();
    #pragma omp parallel for collapse(2)
    for (size_t j = 0; j < cols; ++j)
        for (size_t i = 0; i < rows; ++i)
        {
            double sum = 0.0;
            const size_t ia=i*Astr0;
            const size_t jb=j*Bstr1;
            for (size_t k = 0; k < inner_dim; ++k)
            {
                const size_t ai=ia+k*Astr1;
                const size_t bi=k*Bstr0+jb;
                sum += A.dpdata[ai] *B.dpdata[bi];
            }
            const size_t ci=i*Cstr0+j*Cstr1;
            C.dpdata[ci]= sum;
        }

    auto t12 = high_resolution_clock::now();
    auto ms_int1 = duration_cast<milliseconds>(t12 - t11);
    std::cout << "cpu parallel without simd duration "<< ms_int1.count() <<
"ms\n";

}

on my ryzen 12 core 9700 processor, it can multiply 1000x1000 matrices in 82
miliseconds with -o3 and gcc-16.

If I add a 

#pragma omp simd reduction (+:sum) 

statement before the inner loop over k, there would be no warning. The inner
loop over k has no side effects, since A and B are const.

Unfortunately, code with the 

#pragma omp simd reduction(+:sum) 

statement would execute in slow 132 miliseconds if compiled with -O3 and
gcc-16.1 on cpu. The internal optimizer of gcc seems to know better what to do
with this loop.



For the nvptx backend, this is not the case and the gcc15/16 compiler needs
guidance.

Here is the same function but in a version that offloads to a target device:


void matrix_multiply_dot_g2(  const DataBlock2& A, const  DataBlock2& B,
DataBlock2& C)
{
    const size_t Astr0=A.dpstrides[0];
    const size_t Astr1=A.dpstrides[1];
    const size_t Bstr0=B.dpstrides[0];
    const size_t Bstr1=B.dpstrides[1];
    const size_t Cstr0=C.dpstrides[0];
    const size_t Cstr1=C.dpstrides[1];

    const size_t rows=A.dpextents[0];
    const size_t cols=B.dpextents[1];
    const size_t inner_dim=A.dpextents[1];

   #pragma omp target enter data map(to:A,
A.dpdata[0:A.dpdatalength],A.dpextents[0:2],A.dpstrides[0:2],\
B.dpdata[0:B.dpdatalength],B.dpextents[0:2],B.dpstrides[0:2], \
C,C.dpdata[0:C.dpdatalength],C.dpextents[0:2],C.dpstrides[0:2])
    auto t11 = high_resolution_clock::now();
    #pragma omp target teams distribute parallel for collapse(2)
    for (size_t j = 0; j < cols; ++j)
        for (size_t i = 0; i < rows; ++i)
        {
            double sum = 0.0;
            const size_t ia=i*Astr0;
            const size_t jb=j*Bstr1;
            for (size_t k = 0; k < inner_dim; ++k)
            {
                const size_t ai=ia+k*Astr1;
                const size_t bi=k*Bstr0+jb;
                sum += A.dpdata[ai] *B.dpdata[bi];
            }
            const size_t ci=i*Cstr0+j*Cstr1;
            C.dpdata[ci]= sum;
        }

    auto t12 = high_resolution_clock::now();
    #pragma omp target update from(C.dpdata[0:C.dpdatalength])
    #pragma omp target exit data map(delete:
A.dpdata[0:A.dpdatalength],A.dpextents[0:2],A.dpstrides[0:2],A,\
B.dpdata[0:B.dpdatalength],B.dpextents[0:2],B.dpstrides[0:2],B,\
C.dpdata[0:C.dpdatalength],C.dpextents[0:2],C.dpstrides[0:2],C)
    auto ms_int1 = duration_cast<milliseconds>(t12 - t11);
    std::cout << "gpu parallel without simd duration "<< ms_int1.count() <<
"ms\n";

}


Without simd, as it is written above, it takes 177 miliseconds on gpu, slower
than my cpu.


with an inserted 

#pragma omp simd reduction(+:sum) 

it is be considerable faster. It takes 28 miliseconds, faster than my cpu, as
it should be.



Attached is the source of a little benchmark program which evaluates this with
typical results like:


gpu parallel with simd duration 28ms
gpu parallel without simd duration 177ms

cpu parallel with simd duration 132ms
cpu parallel without simd duration 82ms


Compile it with 

-march=native -fopenmp -foffload=nvptx-none -fno-stack-protector  -O3 -Wall

and gcc-16.



Assume now you wanted to enclose the function matrix_multiply_dot_g1 inside a

#pragma omp begin declare target
#pragma omp end declare target 

region. 

Assume further you offload an array of hundred matrices A[i],B[i],C[i] on gpu,
and in another function, you multiply them in parallel together as follows:


#pragma omp target teams distribute
for(size_t i=0;i<matrix_count;i++)
   matrix_multiply_dot_g1(A[i],B[i],C[i]);


This code would then run slow, because in the target region, the simd pragma
would be necessary for gcc to use fast vector optimizations,


On the other hand, matrix_multiply_dot_g1 only fast on cpu with the simd
statement missing..


Currently, a workaround is to make an if (simd:) clause into the inner loop as 
follows:


#pragma omp simd reduction(+:sum) if(simd:!omp_is_initial_device()) 

then the simd is only added to the inner loop if we are on gpu and not
otherwise.

The if clause yields results like:

gpu parallel with simd duration 28ms

cpu parallel with simd duration 65ms

cpu parallel without simd duration 61ms


The small slowdown on cpu with the if clause is because it has to evaluate a
condition in every thread, but at least it has improved from 130ms on the cpu,
whereas for the gpu version, the if clause does not change the good simd
result...


Of course, it is a bit awkward, to add a 

#pragma omp simd 

clause only if one is on gpu


if gcc has good automatic vector optimizers for simd loops on cpu,
then it should choose appropriate automatic optimizers for simt loops on gpu
too.

One should not need to remind gcc manually that it should vectorize loops on
gpu targets as it is now.

Reply via email to