https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95397
--- Comment #11 from Kirill Chilikin <chilikin.k at gmail dot com> ---
Some further debugging...
After vector-loop fork, there is a loop which looks like this:
mov.u64 %r85, %frame;
mov.u32 %r86, 35;
$L11:
add.u32 %r86, %r86, -1;
ld.u64 %r84, [%r85];
mov.b64 {%r88,%r89}, %r84;
shfl.sync.idx.b32 %r88, %r88, 0, 31, 0xffffffff;
shfl.sync.idx.b32 %r89, %r89, 0, 31, 0xffffffff;
mov.b64 %r84, {%r88,%r89};
st.u64 [%r85], %r84;
setp.ne.u32 %r87, %r86, 0;
add.u64 %r85, %r85, 8;
@%r87 bra.uni $L11;
At the beginning, %frame = 0x7fffdffffca0
The stored vector B starts at 0x7fffdffffcb0. However, since the memory
is not synchronized, it looks like
(cuda-gdb) p ((double*)(0x7fffdffffcb0))[0]@10
$26 = {1, 0, 0, 0, 0, 0, 0, 0, 0, 0}
from the thread 0,
(cuda-gdb) p ((double*)(0x7fffdffffcb0))[0]@10
$27 = {0, 2, 0, 0, 0, 0, 0, 0, 0, 0}
from the thread 1, etc. This loop is executed in parallel by all threads,
but it is sequential. On the third iteration, it reaches the start of the
stored vector. When ld.u64 is executed, thread-dependent values as listed
above are loaded. The instructions shfl.sync.idx.b32 with these specific
arguments
load value from the first lane to all lanes. For vector element 1, it works
just fine, but for all other elements, zeros are loaded. And then, when st.u64
is executed, vector elements are stored back into memory, resulting in memory
corruption.
This loop is generated in nvptx_propagate() from gcc/config/nvptx/nvptx.cc.
For the original example, the value assignment
if (i .eq. 5) then
c = a + b
end if
happens in thread 4. Similarly, the values in other threads are 0,
and they stay like this until the exit from the device code.
This does not seem to be a Fortran issue.