[llvm-branch-commits] [llvm] [AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ (PR #161638)

via llvm-branch-commits Thu, 02 Oct 2025 01:34:44 -0700

llvmbot wrote:


<!--LLVM PR SUMMARY COMMENT-->

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

<details>
<summary>Changes</summary>

They were previously optimized to not emit any waitcnt, which is technically 
correct because there is no reordering of operations at workgroup scope in CU 
mode for GFX10+.

This breaks transitivity however, for example if we have the following sequence 
of events in one thread:

- some stores
- store atomic release syncscope("workgroup")
- barrier

then another thread follows with

- barrier
- load atomic acquire
- store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it cannot 
release them at the wider scope. Our release fences aren't strong enough to 
"wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for releasing 
other wave's stores because only GFX12 can do that with `global_wb`. GFX10-11 
do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with both 
acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization has to 
be paid either at the acq, or the rel. Strengthening the releases would just 
make the memory model more complex but wouldn't help performance.

So the choice here is to streamline the code sequences by making  CU and WGP 
mode emit identical code for release (or stronger) atomic ordering.

This also removes the `vm_vsrc(0)` wait before barriers. Now that the release 
fence in CU mode is strong enough, it is no longer needed.

Supersedes #<!-- -->160501
Solves SC1-6454

---

Patch is 428.16 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/161638.diff


16 Files Affected:

- (modified) llvm/docs/AMDGPUUsage.rst (+3-58) 
- (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+14-37) 
- (modified) 
llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+24-6) 
- (modified) llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll (-1) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll (+8-12) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll 
(+48) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+48-9) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+8-3) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll 
(+601-185) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll 
(+8-3) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll 
(+540-92) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+240-90) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-cluster.ll 
(+240-90) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll 
(+240-90) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+8-3) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll 
(+240-90) 


``````````diff
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 74b7604fda56d..cba86b3d5447e 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -13229,9 +13229,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      store atomic release      - workgroup    - global   1. s_waitcnt 
lgkmcnt(0) &
                                               - generic     vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit 
vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL, omit
                                                              lgkmcnt(0).
                                                            - Could be split 
into
@@ -13277,8 +13274,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
                                                          2. 
buffer/global/flat_store
      store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) 
& vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit.
                                                            - If OpenCL, omit.
                                                            - Could be split 
into
                                                              separate s_waitcnt
@@ -13366,9 +13361,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      atomicrmw    release      - workgroup    - global   1. s_waitcnt 
lgkmcnt(0) &
                                               - generic     vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit 
vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL, omit 
lgkmcnt(0).
                                                            - Could be split 
into
                                                              separate s_waitcnt
@@ -13413,8 +13405,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
                                                          2. 
buffer/global/flat_atomic
      atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) 
& vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit.
                                                            - If OpenCL, omit.
                                                            - Could be split 
into
                                                              separate s_waitcnt
@@ -13498,9 +13488,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      fence        release      - workgroup    *none*     1. s_waitcnt 
lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit 
vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL and
                                                              address space is
                                                              not generic, omit
@@ -13627,9 +13614,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt 
lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit 
vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL, omit
                                                              lgkmcnt(0).
                                                            - Must happen after
@@ -13681,8 +13665,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
                                                          2. 
buffer/global_atomic
                                                          3. s_waitcnt 
vm/vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit.
                                                            - Use vmcnt(0) if 
atomic with
                                                              return and 
vscnt(0) if
                                                              atomic with 
no-return.
@@ -13707,8 +13689,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
 
      atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) 
& vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit.
                                                            - If OpenCL, omit.
                                                            - Could be split 
into
                                                              separate s_waitcnt
@@ -13768,9 +13748,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt 
lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit 
vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL, omit 
lgkmcnt(0).
                                                            - Could be split 
into
                                                              separate s_waitcnt
@@ -13816,9 +13793,9 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
                                                          3. s_waitcnt 
lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit 
vmcnt(0) and
-                                                             vscnt(0).
+                                                           - If atomic with 
return, omit
+                                                             vscnt(0), if 
atomic with
+                                                             no-return, omit 
vmcnt(0).
                                                            - If OpenCL, omit 
lgkmcnt(0).
                                                            - Must happen before
                                                              the following
@@ -13991,9 +13968,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      fence        acq_rel      - workgroup    *none*     1. s_waitcnt 
lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit 
vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL and
                                                              address space is
                                                              not generic, omit
@@ -14223,9 +14197,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt 
lgkmcnt(0) &
                                               - generic     vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit 
vmcnt(0) and
-                                                             vscnt(0).
                                                            - Could be split 
into
                                                              separate s_waitcnt
                                                              vmcnt(0), 
s_waitcnt
@@ -14334,8 +14305,6 @@ table 
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
 
                                                          1. s_waitcnt vmcnt(0) 
& vscnt(0)
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit.
                                                            - Could be split 
into
                                                              separate s_waitcnt
                                                              vmcnt(0) and 
s_waitcnt
@@ -15337,8 +15306,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit 
``s_wait_dscnt 0x0``.
                                                            - The waits can be
@@ -15384,8 +15351,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit.
                                                            - The waits can be
@@ -15479,8 +15444,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit 
``s_wait_dscnt 0x0``.
                                                            - If OpenCL and CU 
wavefront
@@ -15530,8 +15493,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit 
all.
                                                            - The waits can be
@@ -15623,8 +15584,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit 
``s_wait_dscnt 0x0``.
                                                            - If OpenCL and
@@ -15754,8 +15713,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit 
``s_wait_dscnt 0x0``.
                                                            - Must happen after
@@ -15812,8 +15769,6 @@ the instruction in the code sequence that references 
the table.
                                                             | **Atomic without 
return:**
                                                             | 
``s_wait_storecnt 0x0``
 
-                                                           - If CU wavefront 
execution
-                                                             mode, omit.
                                                            - Must happen before
                                                              the following
                                                              ``global_inv``.
@@ -15838,8 +15793,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit.
                                                            - The waits can be
@@ -15901,8 +15854,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit 
``s_wait_loadcnt 0x0``.
                                                            - The waits can be
@@ -16154,8 +16105,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL and
                                                              address space is
@@ -16384,8 +16333,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit
                                                              ``s_wait_dscnt 
0x0``
@@ -16492,8 +16439,6 @@ the instruction in the code sequence that references 
the table.
                                                             | 
``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 
0x0``
                                                             | ``s_wait_dscnt 
0x0``
-                                                            | **CU wavefront 
execution mode:**
-                                                            | ``s_wait_dscnt 
0x0``
 
                                                            - If OpenCL, omit 
all.
                                                            - The waits can be
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp 
b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 362ef140a28f8..29c326893539b 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -390,12 +390,6 @@ class SICacheControl {
                              bool IsCrossAddrSpaceOrdering,
                              Position Pos) const = 0;
 
-  /// Inserts any necessary instructions before the barrier start instruction
-  /// \p MI in order to support pairing of barriers and fences.
-  virtual bool insertBarrierStart(MachineBasicBlock::iterator &MI) const {
-    return false;
-  };
-
   /// Virtual destructor to allow derivations to be deleted.
   virtual ~SICacheControl() = default;
 };
@@ -576,12 +570,8 @@ class SIGfx10CacheControl : public SIGfx7CacheControl {
                   bool IsCrossAddrSpaceOrdering, Position Pos,
                   AtomicOrdering Order, bool AtomicsOnly) const override;
 
-  bool insertAcquire(MachineBasicBlock::iterator &MI,
-                     SIAtomicScope Scope,
-                     SIAtomicAddrSpace AddrSpace,
-                     Position Pos) const override;
-
-  bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
+  bool insertAcquire(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
+                     SIAtomicAddrSpace AddrSpace, Position Pos) const override;
 };
 
 class SIGfx11CacheControl : public SIGfx10CacheControl {
@@ -2046,8 +2036,11 @@ bool 
SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
       // the WGP. Therefore need to wait for operations to complete to ensure
       // they are visible to waves in the other CU as the L0 is per CU.
       // Otherwise in CU mode and all waves of a work-group are on the same CU
-      // which shares the same L0.
-      if (!ST.isCuModeEnabled()) {
+      // which shares the same L0. Note that we still need to wait when
+      // performing a release in this mode to respect the transitivity of
+      // happens-before, e.g. other waves of the workgroup must be able to
+      // release the memory from another wave at a wider scope.
+      if (!ST.isCuModeEnabled() || isReleaseOrStronger(Order)) {
         if ((Op & SIMemOp::LOAD) != SIMemOp::NONE)
           VMCnt |= true;
         if ((Op & SIMemOp::STORE) != SIMemOp::NONE)
@@ -2202,22 +2195,6 @@ bool 
SIGfx10CacheControl::insertAcquire(MachineBasicBlock::i...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/161638
_______________________________________________
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

[llvm-branch-commits] [llvm] [AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ (PR #161638)

Reply via email to