https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #7 from Thorsten Kurth <thorstenkurth at me dot com> ---
Hello Jakub,

thanks for your comment but I think the parallel for is not racey. Every thread
is working a block of i-indices so that is fine. The dotprod kernel is actually
a kernel from the OpenMP standard documentation and I am sure that this is not
racey. 

The example with the regions you mentioned I do not see a problem with that
either. By default, everything is shared so the variable is updated by all the
threads/teams with the same value. 

The issue is that num_teams=1 is only true for CPU, for GPU it is OS, driver,
architecture and whatever dependent. 

Concerning splitting distribute and parallel: I tried both combinations and
found that they behave the same. But in the end I split it so that I could
comment out the distribute section to see if that makes a performance
difference (and it does).

I believe that the overhead instructions are responsible for the bad
performance because that is the only thing which distinguishes the target
annotated code from the plain openmp code. I used vtune to look at thread
utilization and they look similar, L1, L2 hit rates are very close (100% vs 99%
and 92% vs 89%) for the plain openmp and for the target annotated code. BUT the
performance of the target annotated code can be up to 10x worse. So I think
there might be register spilling due to copying a large amount of variables. If
you like I can point you to the github repo code (BoxLib) which clearly
exhibits this issue. This small test case only shows minor overhead of OpenMP
4.5 vs, say, OpenMP 3 but it clearly generates some additional overhead.

Reply via email to