https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95397

--- Comment #10 from Kirill Chilikin <chilikin.k at gmail dot com> ---
I am very sorry. Please ignore the comments above except for the new testcase.
The explanation there is completely wrong. Of course, the threads are executed
in parallel, and exit after one iteration is correct for short loops. The
actual explanation is below:

The first loop is

!$ACC LOOP VECTOR
DO I = 1, 32
  B(I) = I
ENDDO

Its disassembled code with explanations about what assignments happen
(everything is checked in debugger) is:

=> 0x00007fffd7279510 <+2576>:  BSSY B0,0x7fffd7279610
   0x00007fffd7279520 <+2592>:  HFMA2.MMA R9,-RZ,RZ,0,1.966953277587890625e-06
   0x00007fffd7279530 <+2608>:  ISETP.GE.AND P0,PT,R0,0x20,PT
   0x00007fffd7279540 <+2624>:  @P0 BRA 0x7fffd7279600
   # R5 = %tid.x + 1
   0x00007fffd7279550 <+2640>:  IADD3 R5,R0.reuse,0x1,RZ
   # R0 = %tid.x + 32
   0x00007fffd7279560 <+2656>:  IADD3 R0,R0,c[0x0][0x0],RZ
   # R7 = (float)%tid.x
   0x00007fffd7279570 <+2672>:  I2F R7,R5
   # R3 = %tid.x
   0x00007fffd7279580 <+2688>:  IADD3 R3,P0,R5.reuse,-0x1,RZ
   # R28 = 0xdffffd20
   # R29 = 0x7fff
   0x00007fffd7279590 <+2704>:  ST.E [R28.64+0x4],R5
   # R4 = 0
   0x00007fffd72795a0 <+2720>:  LEA.HI.X.SX32 R4,R5,0xffffffff,0x1,P0
   # R2 = 0xdffffd20 + %tid.x * 4
   0x00007fffd72795b0 <+2736>:  LEA R2,P0,R3,R28,0x2
   # R3 = 0x7fff
   0x00007fffd72795c0 <+2752>:  LEA.HI.X R3,R3,R29,R4,0x2,P0
   # P0 = 0x1
   0x00007fffd72795d0 <+2768>:  ISETP.GE.AND P0,PT,R0,0x20,PT
   # R7 is stored. But see memory examination below...
   0x00007fffd72795e0 <+2784>:  ST.E [R2.64+0xc],R7
   0x00007fffd72795f0 <+2800>:  @!P0 BRA 0x7fffd7279550

The memory after R7 storage looks like:

(cuda-gdb) cuda thread 0
(cuda-gdb) p *(float*)(0x7fffdffffd20+0xc)
$12 = 1
(cuda-gdb) p *(float*)(0x7fffdffffd24+0xc)
$13 = 0
(cuda-gdb) p *(float*)(0x7fffdffffd28+0xc)
$14 = 0
(cuda-gdb) cuda thread 1
(cuda-gdb) p *(float*)(0x7fffdffffd20+0xc)
$15 = 0
(cuda-gdb) p *(float*)(0x7fffdffffd24+0xc)
$16 = 2
(cuda-gdb) p *(float*)(0x7fffdffffd28+0xc)
$17 = 0
(cuda-gdb) cuda thread 2
(cuda-gdb) p *(float*)(0x7fffdffffd20+0xc)
$18 = 0
(cuda-gdb) p *(float*)(0x7fffdffffd24+0xc)
$19 = 0
(cuda-gdb) p *(float*)(0x7fffdffffd28+0xc)
$20 = 3

The values are correctly set, but only for each individual thread.

The second loop is

!$ACC LOOP VECTOR
DO I = 1, 32
  A(I) = B(I)
ENDDO

Its disassembled code with explanations:

=> 0x00007fffd72797c0 <+3264>:  BSSY B0,0x7fffd7279900
   0x00007fffd72797d0 <+3280>:  HFMA2.MMA R11,-RZ,RZ,0,1.966953277587890625e-06
   0x00007fffd72797e0 <+3296>:  ISETP.GE.AND P0,PT,R0,0x20,PT
   0x00007fffd72797f0 <+3312>:  @P0 BRA 0x7fffd72798f0
   # R28 = 0xdffffd20
   # R29 = 0x7fff
   # R2 = 0xb3600000
   0x00007fffd7279800 <+3328>:  LD.E.64 R2[R28.64+0x90]
   # R9 = %tid.x + 1
   0x00007fffd7279810 <+3344>:  IADD3 R9,R0,0x1,RZ
   # R7 = %tid.x
   0x00007fffd7279820 <+3360>:  IADD3 R7,P0,R9.reuse,-0x1,RZ
   0x00007fffd7279830 <+3376>:  ST.E [R28.64+0x8],R9
   # R8 = 0
   0x00007fffd7279840 <+3392>:  LEA.HI.X.SX32 R8,R9,0xffffffff,0x1,P0
   # R4 = 0xdffffd20 + %tid.x * 4
   0x00007fffd7279850 <+3408>:  LEA R4,P0,R7,R28,0x2
   # R5 = 0x7fff
   0x00007fffd7279860 <+3424>:  LEA.HI.X R5,R7,R29,R8,0x2,P0
   # Now it should read out stored R7 as R5.
   # R5 = 1.0 for thread 0, and 0 for other threads.
   # From examining memory, R5 is already corrupted before this instruction.
   # See memory examination below.
   0x00007fffd7279870 <+3440>:  LD.E R5[R4.64+0xc]
   0x00007fffd7279880 <+3456>:  LD.E.64 R2[R2.64]
   0x00007fffd7279890 <+3472>:  IADD3 R0,R0,c[0x0][0x0],RZ
   0x00007fffd72798a0 <+3488>:  LEA R6,P0,R7,R2,0x2
   0x00007fffd72798b0 <+3504>:  LEA.HI.X R7,R7,R3,R8,0x2,P0
   0x00007fffd72798c0 <+3520>:  ST.E [R6.64],R5
   0x00007fffd72798d0 <+3536>:  ISETP.GE.AND P0,PT,R0,0x20,PT
   0x00007fffd72798e0 <+3552>:  @!P0 BRA 0x7fffd7279800

The memory before R5 loading looks like

(cuda-gdb) p *(float*)(0x7fffdffffd20+0xc)
$22 = 1
(cuda-gdb) p *(float*)(0x7fffdffffd24+0xc)
$23 = 0
(cuda-gdb) p *(float*)(0x7fffdffffd25+0xc)
$24 = 0

And this result is the same regardless of the current thread. They all see the
same values as stored from the thread 0.

I do not have sufficient knowledge to debug it further, but it looks like some
memory synchronization should be added. Also it is very likely that the result
with the current code may depend on particular GPU architecture. This explains
that it could not be reproduced...

Reply via email to