[clang] [llvm] [AMDGPU] Track tensor load/store DMAs with asyncmark (PR #200775)

Alexander Weinrauch via cfe-commits Mon, 22 Jun 2026 01:45:11 -0700

AlexAUT wrote:

> Petition to remove
> 
> ```
> def __builtin_amdgcn_s_wait_asynccnt : AMDGPUBuiltin<"void(_Constant unsigned 
> short)", [], "gfx1250-insts">;
> def __builtin_amdgcn_s_wait_tensorcnt : AMDGPUBuiltin<"void(_Constant 
> unsigned short)", [], "gfx1250-insts">;
> ```
> 
> now? Any opposed? @krzysz00 @arsenm @ssahasra @shiltian Any folks not using 
> asyncmark already really should be.


Maybe I am wrong but with asyncmarks/async_wait we are unable to specify the 
memory instruction type (HW counter)? e.g. in Triton for persistent kernels we 
generate sequences like:
```asm
llvm.amdgcn.global.store.async.to.lds.b32
async_mark
llvm.amdgcn.tensor.load.to.lds
async_mark

async_wait -> just on the tensor_load?
```

Since the global_async_store can finish out of order with the tensor_load we 
want to sync the global async store at a much later stage. Is this possible 
with the current API? If not, then Triton cannot use async_marks in it's 
current form and we will require a bitset on the async_wait to define which 
memory types, tensor memory ops and/or global (async) memory ops, we want to 
sync.

https://github.com/llvm/llvm-project/pull/200775
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [AMDGPU] Track tensor load/store DMAs with asyncmark (PR #200775)

Reply via email to