https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #7 from Thorsten Kurth <thorstenkurth at me dot com> --- Hello Jakub, thanks for your comment but I think the parallel for is not racey. Every thread is working a block of i-indices so that is fine. The dotprod kernel is actually a kernel from the OpenMP standard documentation and I am sure that this is not racey. The example with the regions you mentioned I do not see a problem with that either. By default, everything is shared so the variable is updated by all the threads/teams with the same value. The issue is that num_teams=1 is only true for CPU, for GPU it is OS, driver, architecture and whatever dependent. Concerning splitting distribute and parallel: I tried both combinations and found that they behave the same. But in the end I split it so that I could comment out the distribute section to see if that makes a performance difference (and it does). I believe that the overhead instructions are responsible for the bad performance because that is the only thing which distinguishes the target annotated code from the plain openmp code. I used vtune to look at thread utilization and they look similar, L1, L2 hit rates are very close (100% vs 99% and 92% vs 89%) for the plain openmp and for the target annotated code. BUT the performance of the target annotated code can be up to 10x worse. So I think there might be register spilling due to copying a large amount of variables. If you like I can point you to the github repo code (BoxLib) which clearly exhibits this issue. This small test case only shows minor overhead of OpenMP 4.5 vs, say, OpenMP 3 but it clearly generates some additional overhead.