[clang] [llvm] [mlir] [NVPTX] Unify and extend barrier{.cta} intrinsic support (PR #140615)

Durgadoss R via cfe-commits Tue, 20 May 2025 00:36:58 -0700

================
@@ -199,21 +199,58 @@ map in the following way to CUDA builtins:
 Barriers
 --------
 
-'``llvm.nvvm.barrier0``'
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+'``llvm.nvvm.barrier.cta.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
 """""""
 
 .. code-block:: llvm
 
-  declare void @llvm.nvvm.barrier0()
+  declare void @llvm.nvvm.barrier.cta.sync(i32 %id, i32 %n)
+  declare void @llvm.nvvm.barrier.cta.sync.all(i32 %id)
+  declare void @llvm.nvvm.barrier.cta.arrive(i32 %id, i32 %n)
+
+  declare void @llvm.nvvm.barrier.cta.sync.aligned(i32 %id, i32 %n)
+  declare void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 %id)
+  declare void @llvm.nvvm.barrier.cta.arrive.aligned(i32 %id, i32 %n)
 
 Overview:
 """""""""
 
-The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
-instruction, equivalent to the ``__syncthreads()`` call in CUDA.
+The '``@llvm.nvvm.barrier.cta.*``' family of intrinsics perform barrier
+synchronization and communication within a CTA. They can be used by the threads
+within the CTA for synchronization and communication.
+
+Semantics:
+""""""""""
+
+Operand %id specifies a logical barrier resource and must fall within the range
+0 through 15. When present, operand %n specifies the number of threads
+participating in the barrier. When specifying a thread count, the value must be
+a multiple of the warp size. With the '``@llvm.nvvm.barrier.cta.sync.*``'
+variants, the '``.all``' suffix indicates that all threads in the CTA should
+participate in the barrier and the %n operand is not present.
+
+All forms of the '``@llvm.nvvm.barrier.cta.*``' intrinsic cause the executing
+thread to wait for all non-exited threads from its warp and then marks the
+warp's arrival at the barrier. In addition to signaling its arrival at the 
+barrier, the '``@llvm.nvvm.barrier.cta.sync.*``' intrinsics cause the executing
+thread to wait for non-exited threads of all other warps participating in the
+barrier to arrive. On the other hand, the '``@llvm.nvvm.barrier.cta.arrive.*``'
+intrinsic does not cause the executing thread to wait for threads of other
+participating warps.
+
+When a barrier completes, the waiting threads are restarted without delay,
+and the barrier is reinitialized so that it can be immediately reused.
+
+The '``@llvm.nvvm.barrier.cta.*``' intrinsic has an optional '``.aligned``'
+modifier to indicate textual alignment of the barrier. When specified, it
+indicates that all threads in the CTA will execute the same
+'``@llvm.nvvm.barrier.cta.*``' instruction. In conditionally executed code, an
+aligned '``@llvm.nvvm.barrier.cta.*``' instruction should only be used if it is
+known that all threads in the CTA evaluate the condition identically, otherwise
+behavior is undefined.
----------------
durga4github wrote:


Shall we add a link to the PTX ISA here?

https://github.com/llvm/llvm-project/pull/140615
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [mlir] [NVPTX] Unify and extend barrier{.cta} intrinsic support (PR #140615)

Reply via email to