llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-backend-amdgpu
Author: Pierre van Houtryve (Pierre-vh)
<details>
<summary>Changes</summary>
They were previously optimized to not emit any waitcnt, which is technically
correct because there is no reordering of operations at workgroup scope in CU
mode for GFX10+.
This breaks transitivity however, for example if we have the following sequence
of events in one thread:
- some stores
- store atomic release syncscope("workgroup")
- barrier
then another thread follows with
- barrier
- load atomic acquire
- store atomic release syncscope("agent")
It does not work because, while the other thread sees the stores, it cannot
release them at the wider scope. Our release fences aren't strong enough to
"wait" on stores from other waves.
We also cannot strengthen our release fences any further to allow for releasing
other wave's stores because only GFX12 can do that with `global_wb`. GFX10-11
do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with both
acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization has to
be paid either at the acq, or the rel. Strengthening the releases would just
make the memory model more complex but wouldn't help performance.
So the choice here is to streamline the code sequences by making CU and WGP
mode emit identical code for release (or stronger) atomic ordering.
This also removes the `vm_vsrc(0)` wait before barriers. Now that the release
fence in CU mode is strong enough, it is no longer needed.
Supersedes #<!-- -->160501
Solves SC1-6454
---
Patch is 428.16 KiB, truncated to 20.00 KiB below, full version:
https://github.com/llvm/llvm-project/pull/161638.diff
16 Files Affected:
- (modified) llvm/docs/AMDGPUUsage.rst (+3-58)
- (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+14-37)
- (modified)
llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+24-6)
- (modified) llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll (-1)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll (+8-12)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll
(+48)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+48-9)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+8-3)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll
(+601-185)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll
(+8-3)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll
(+540-92)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+240-90)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-cluster.ll
(+240-90)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll
(+240-90)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+8-3)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll
(+240-90)
``````````diff
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 74b7604fda56d..cba86b3d5447e 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -13229,9 +13229,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
store atomic release - workgroup - global 1. s_waitcnt
lgkmcnt(0) &
- generic vmcnt(0) & vscnt(0)
- - If CU wavefront
execution
- mode, omit
vmcnt(0) and
- vscnt(0).
- If OpenCL, omit
lgkmcnt(0).
- Could be split
into
@@ -13277,8 +13274,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
2.
buffer/global/flat_store
store atomic release - workgroup - local 1. s_waitcnt vmcnt(0)
& vscnt(0)
- - If CU wavefront
execution
- mode, omit.
- If OpenCL, omit.
- Could be split
into
separate s_waitcnt
@@ -13366,9 +13361,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
atomicrmw release - workgroup - global 1. s_waitcnt
lgkmcnt(0) &
- generic vmcnt(0) & vscnt(0)
- - If CU wavefront
execution
- mode, omit
vmcnt(0) and
- vscnt(0).
- If OpenCL, omit
lgkmcnt(0).
- Could be split
into
separate s_waitcnt
@@ -13413,8 +13405,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
2.
buffer/global/flat_atomic
atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0)
& vscnt(0)
- - If CU wavefront
execution
- mode, omit.
- If OpenCL, omit.
- Could be split
into
separate s_waitcnt
@@ -13498,9 +13488,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
fence release - workgroup *none* 1. s_waitcnt
lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront
execution
- mode, omit
vmcnt(0) and
- vscnt(0).
- If OpenCL and
address space is
not generic, omit
@@ -13627,9 +13614,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
atomicrmw acq_rel - workgroup - global 1. s_waitcnt
lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront
execution
- mode, omit
vmcnt(0) and
- vscnt(0).
- If OpenCL, omit
lgkmcnt(0).
- Must happen after
@@ -13681,8 +13665,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
2.
buffer/global_atomic
3. s_waitcnt
vm/vscnt(0)
- - If CU wavefront
execution
- mode, omit.
- Use vmcnt(0) if
atomic with
return and
vscnt(0) if
atomic with
no-return.
@@ -13707,8 +13689,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0)
& vscnt(0)
- - If CU wavefront
execution
- mode, omit.
- If OpenCL, omit.
- Could be split
into
separate s_waitcnt
@@ -13768,9 +13748,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
atomicrmw acq_rel - workgroup - generic 1. s_waitcnt
lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront
execution
- mode, omit
vmcnt(0) and
- vscnt(0).
- If OpenCL, omit
lgkmcnt(0).
- Could be split
into
separate s_waitcnt
@@ -13816,9 +13793,9 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
3. s_waitcnt
lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront
execution
- mode, omit
vmcnt(0) and
- vscnt(0).
+ - If atomic with
return, omit
+ vscnt(0), if
atomic with
+ no-return, omit
vmcnt(0).
- If OpenCL, omit
lgkmcnt(0).
- Must happen before
the following
@@ -13991,9 +13968,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
fence acq_rel - workgroup *none* 1. s_waitcnt
lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront
execution
- mode, omit
vmcnt(0) and
- vscnt(0).
- If OpenCL and
address space is
not generic, omit
@@ -14223,9 +14197,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
load atomic seq_cst - workgroup - global 1. s_waitcnt
lgkmcnt(0) &
- generic vmcnt(0) & vscnt(0)
- - If CU wavefront
execution
- mode, omit
vmcnt(0) and
- vscnt(0).
- Could be split
into
separate s_waitcnt
vmcnt(0),
s_waitcnt
@@ -14334,8 +14305,6 @@ table
:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
1. s_waitcnt vmcnt(0)
& vscnt(0)
- - If CU wavefront
execution
- mode, omit.
- Could be split
into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
@@ -15337,8 +15306,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit
``s_wait_dscnt 0x0``.
- The waits can be
@@ -15384,8 +15351,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit.
- The waits can be
@@ -15479,8 +15444,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit
``s_wait_dscnt 0x0``.
- If OpenCL and CU
wavefront
@@ -15530,8 +15493,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit
all.
- The waits can be
@@ -15623,8 +15584,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit
``s_wait_dscnt 0x0``.
- If OpenCL and
@@ -15754,8 +15713,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit
``s_wait_dscnt 0x0``.
- Must happen after
@@ -15812,8 +15769,6 @@ the instruction in the code sequence that references
the table.
| **Atomic without
return:**
|
``s_wait_storecnt 0x0``
- - If CU wavefront
execution
- mode, omit.
- Must happen before
the following
``global_inv``.
@@ -15838,8 +15793,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit.
- The waits can be
@@ -15901,8 +15854,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit
``s_wait_loadcnt 0x0``.
- The waits can be
@@ -16154,8 +16105,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL and
address space is
@@ -16384,8 +16333,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit
``s_wait_dscnt
0x0``
@@ -16492,8 +16439,6 @@ the instruction in the code sequence that references
the table.
|
``s_wait_storecnt 0x0``
| ``s_wait_loadcnt
0x0``
| ``s_wait_dscnt
0x0``
- | **CU wavefront
execution mode:**
- | ``s_wait_dscnt
0x0``
- If OpenCL, omit
all.
- The waits can be
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 362ef140a28f8..29c326893539b 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -390,12 +390,6 @@ class SICacheControl {
bool IsCrossAddrSpaceOrdering,
Position Pos) const = 0;
- /// Inserts any necessary instructions before the barrier start instruction
- /// \p MI in order to support pairing of barriers and fences.
- virtual bool insertBarrierStart(MachineBasicBlock::iterator &MI) const {
- return false;
- };
-
/// Virtual destructor to allow derivations to be deleted.
virtual ~SICacheControl() = default;
};
@@ -576,12 +570,8 @@ class SIGfx10CacheControl : public SIGfx7CacheControl {
bool IsCrossAddrSpaceOrdering, Position Pos,
AtomicOrdering Order, bool AtomicsOnly) const override;
- bool insertAcquire(MachineBasicBlock::iterator &MI,
- SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- Position Pos) const override;
-
- bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
+ bool insertAcquire(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
+ SIAtomicAddrSpace AddrSpace, Position Pos) const override;
};
class SIGfx11CacheControl : public SIGfx10CacheControl {
@@ -2046,8 +2036,11 @@ bool
SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
// the WGP. Therefore need to wait for operations to complete to ensure
// they are visible to waves in the other CU as the L0 is per CU.
// Otherwise in CU mode and all waves of a work-group are on the same CU
- // which shares the same L0.
- if (!ST.isCuModeEnabled()) {
+ // which shares the same L0. Note that we still need to wait when
+ // performing a release in this mode to respect the transitivity of
+ // happens-before, e.g. other waves of the workgroup must be able to
+ // release the memory from another wave at a wider scope.
+ if (!ST.isCuModeEnabled() || isReleaseOrStronger(Order)) {
if ((Op & SIMemOp::LOAD) != SIMemOp::NONE)
VMCnt |= true;
if ((Op & SIMemOp::STORE) != SIMemOp::NONE)
@@ -2202,22 +2195,6 @@ bool
SIGfx10CacheControl::insertAcquire(MachineBasicBlock::i...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/161638
_______________________________________________
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits