https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104893
Bug ID: 104893
Summary: [nvptx] Handle Independent Thread Scheduling for
sm_70+ with -msoft-stack
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: vries at gcc dot gnu.org
Target Milestone: ---
We use -msoft-stack for openmp programs:
...
'-msoft-stack'
Generate code that does not use '.local' memory directly for stack
storage. Instead, a per-warp stack pointer is maintained
explicitly. This enables variable-length stack allocation (with
variable-length arrays or 'alloca'), and when global memory is used
for underlying storage, makes it possible to access automatic
variables from other threads, or with atomic instructions.
...
Starting with sm_70, we have Independent Thread Scheduling: "the GPU maintains
execution state per thread, including a program counter and call stack".
The per-thread call stack is handled for .local memory by the CUDA driver.
For the 'soft stack' that's not the case. So, it's possible that different
threads start to read and write values to a stack address that is meant to be
thread private, but which in reality is shared between all threads in the warp.