reduction-cplx-dbl.c produces wrong gang-reduction result

burnus at gcc dot gnu.org via Gcc-bugs Thu, 07 Aug 2025 03:55:50 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121416


--- Comment #3 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Andrew Stubbs from comment #2)
> Let's look at why the atomic instructions that exist aren't working for us,
> before we try to use the big dumb hammer fix (and does that solution
> *really* work, if we don't understand the cache architecture properly?)

I think the question is whether the OpenACC code is correct or not - or to find
a non-OpenACC code which also shows the issue and should be correct.

I find the OpenACC generated code to convoluted to really see whether it is
valid or not.

* * *

OpenACC: In any case, for the gang code, I get (cf. comment 0) with OpenACC:

sum:
104.833984 + i 109.667969
16.898438 + i 17.796875

prod:
698317287061244416.000000 + i -950434920224383616.000000
29.758044 + i 6.070528

That is: Both 'sum' and 'prod' are wrong.

* * *

While with OpenMP, the following Fortran code produces the correct result:

!$omp target teams distribute parallel do map(to: ary) map(tofrom: tsum, tprod)
    do ix = 1, N
!$omp atomic update
        tsum = tsum + ary(ix)
!$omp atomic update
        tprod = tprod * ary(ix)
    end do

but this code uses - again -

  GOMP_atomic_start
  GOMP_atomic_end

although I wonder whether __atomic_compare_exchange_16 shouldn't have
handled this atomically? (This is available via libatomic for Nvptx and also on
the host. I think for GCN, it is not, but I might be wrong; GCN has no
libatomic, but I am not 100% sure that it doesn't handle it intrinsically.)

* * *

For the sum, one can also put it into two atomics:

#pragma omp target teams \
            distribute parallel for \
            map(to: ary) map(tofrom: tsum, tprod)
    for (int ix = 0; ix < N; ix++)
      {
        #pragma omp atomic update
          __real__ tsum = __real__ tsum + __real__ ary[ix];

        #pragma omp atomic update
          __imag__ tsum = __imag__ tsum + __imag__ ary[ix];
      }

and doing so produces in OpenMP

sum:
104.833984 + i 109.667969
104.833984 + i 109.667969

and uses
    flat_atomic_cmpswap_X2

This works as RE and IM are complete independent. And also shows that there is
no generic issue with atomics.

For multiplication, real and imaginary parts get mixed. Recall that for complex
variables A and B, 'A * B' is:

  (Re A * Re B - Im A * Im B)  +  i*(Re A*Im B + Im A + Re B)

Thus, I can do an 'atomic update' for them, but as soon as one succeeds and the
other fails, I am doomed!

I have no idea how this generated OpenACC handles this - nor why 'sum' is also
wrong with OpenACC.

* * *

Note that I used Fortran above because for C/C++, a complex 'omp atomic update'
is rejected with:

   error: invalid expression type for ‘#pragma omp atomic’

while Clang accepts it; I also do not see anything wrong with using complex
numbers + in gfortran it works (using the atomic_start/atomic_stop workaround).

[Bug target/121416] [gcn][MI300][CDNA3] libgomp.oacc-c-c++-common/reduction-cplx-dbl.c produces wrong gang-reduction result

Reply via email to