[clang] [llvm] [AMDGPU] Track tensor load/store DMAs with asyncmark (PR #200775)

Ryan Mitchell via cfe-commits Tue, 23 Jun 2026 06:10:08 -0700

RyanRio wrote:

> > Petition to remove
> > ```
> > def __builtin_amdgcn_s_wait_asynccnt : AMDGPUBuiltin<"void(_Constant 
> > unsigned short)", [], "gfx1250-insts">;
> > def __builtin_amdgcn_s_wait_tensorcnt : AMDGPUBuiltin<"void(_Constant 
> > unsigned short)", [], "gfx1250-insts">;
> > ```
> > 
> > 
> >     
> >       
> >     
> > 
> >       
> >     
> > 
> >     
> >   
> > now? Any opposed? @krzysz00 @arsenm @ssahasra @shiltian Any folks not using 
> > asyncmark already really should be.
> 
> Maybe I am wrong but with asyncmarks/async_wait we are unable to specify the 
> memory instruction type (HW counter)? e.g. in Triton for persistent kernels 
> we generate sequences like:
> 
> ```assembly
> llvm.amdgcn.global.store.async.to.lds.b32
> async_mark
> llvm.amdgcn.tensor.load.to.lds
> async_mark
> 
> async_wait -> just on the tensor_load?
> ```
> 
> Since the global_async_store can finish out of order with the tensor_load we 
> want to sync the global async store at a much later stage. Is this possible 
> with the current API? If not, then Triton cannot use async_marks in it's 
> current form and we will require a bitset on the async_wait to define which 
> memory types, tensor memory ops and/or global (async) memory ops, we want to 
> sync.


Technically if you use two marks then you can just wait for one of them and get 
the desired behavior, but the general consensus from @ssahasra @nhaehnle et al 
is to provide an imm argument to make this more straightforward.

https://github.com/llvm/llvm-project/pull/200775
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [AMDGPU] Track tensor load/store DMAs with asyncmark (PR #200775)

Reply via email to