https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121416
--- Comment #3 from Tobias Burnus <burnus at gcc dot gnu.org> --- (In reply to Andrew Stubbs from comment #2) > Let's look at why the atomic instructions that exist aren't working for us, > before we try to use the big dumb hammer fix (and does that solution > *really* work, if we don't understand the cache architecture properly?) I think the question is whether the OpenACC code is correct or not - or to find a non-OpenACC code which also shows the issue and should be correct. I find the OpenACC generated code to convoluted to really see whether it is valid or not. * * * OpenACC: In any case, for the gang code, I get (cf. comment 0) with OpenACC: sum: 104.833984 + i 109.667969 16.898438 + i 17.796875 prod: 698317287061244416.000000 + i -950434920224383616.000000 29.758044 + i 6.070528 That is: Both 'sum' and 'prod' are wrong. * * * While with OpenMP, the following Fortran code produces the correct result: !$omp target teams distribute parallel do map(to: ary) map(tofrom: tsum, tprod) do ix = 1, N !$omp atomic update tsum = tsum + ary(ix) !$omp atomic update tprod = tprod * ary(ix) end do but this code uses - again - GOMP_atomic_start GOMP_atomic_end although I wonder whether __atomic_compare_exchange_16 shouldn't have handled this atomically? (This is available via libatomic for Nvptx and also on the host. I think for GCN, it is not, but I might be wrong; GCN has no libatomic, but I am not 100% sure that it doesn't handle it intrinsically.) * * * For the sum, one can also put it into two atomics: #pragma omp target teams \ distribute parallel for \ map(to: ary) map(tofrom: tsum, tprod) for (int ix = 0; ix < N; ix++) { #pragma omp atomic update __real__ tsum = __real__ tsum + __real__ ary[ix]; #pragma omp atomic update __imag__ tsum = __imag__ tsum + __imag__ ary[ix]; } and doing so produces in OpenMP sum: 104.833984 + i 109.667969 104.833984 + i 109.667969 and uses flat_atomic_cmpswap_X2 This works as RE and IM are complete independent. And also shows that there is no generic issue with atomics. For multiplication, real and imaginary parts get mixed. Recall that for complex variables A and B, 'A * B' is: (Re A * Re B - Im A * Im B) + i*(Re A*Im B + Im A + Re B) Thus, I can do an 'atomic update' for them, but as soon as one succeeds and the other fails, I am doomed! I have no idea how this generated OpenACC handles this - nor why 'sum' is also wrong with OpenACC. * * * Note that I used Fortran above because for C/C++, a complex 'omp atomic update' is rejected with: error: invalid expression type for ‘#pragma omp atomic’ while Clang accepts it; I also do not see anything wrong with using complex numbers + in gfortran it works (using the atomic_start/atomic_stop workaround).