================
@@ -61,3 +61,111 @@ Predefined Macros
      - Defined if FP64 instruction is available (deprecated).
 
 Please note that the specific architecture and feature names will vary 
depending on the GPU. Also, some macros are deprecated and may be removed in 
future releases.
+
+AMDGPU Builtins
+===============
+
+Clang provides a set of builtins to access low-level, AMDGPU-specific hardware 
features directly from C, C++, OpenCL C, and HIP. These builtins often map 
directly to a single machine instruction.
+
+.. _builtin-amdgcn-ds-bpermute:
+
+``__builtin_amdgcn_ds_bpermute``
+--------------------------------
+
+Performs a backward (pull) permutation of values within a wavefront. This 
builtin compiles to the
+``ds_bpermute_b32`` instruction and implements a "read from lane" semantic 
using a **byte-based**
+address.
+
+**Syntax**
+
+.. code-block:: c++
+
+  T __builtin_amdgcn_ds_bpermute(int index, T src);
+
+**Summary**
+
+All active lanes in the current wavefront conceptually place their ``src`` 
payloads into an
+internal cross-lane buffer. Each lane then reads a 32-bit value from that 
buffer at the byte
+offset given by ``index`` and returns it as type ``T``. The exchange uses LDS 
hardware paths
+but does not access user-visible LDS or imply any synchronization.
+
+This builtin is **polymorphic**: the type of ``src`` determines the return 
type.
+
+Availability
+------------
+
+- Targets: AMD GCN3 (gfx8) and newer.
+
+Parameters
+----------
+
+- ``index`` (``int``): Byte offset used to select the source lane. Hardware 
only consumes bits
+  ``[7:2]``. To read the 32-bit value from lane *i*, pass ``i * 4`` as the 
index.
+  Indices that select lanes outside the current wave size or lanes that are 
inactive at the call
+  site yield an unspecified value (commonly zero on current hardware).
+
+- ``src`` (``T``): The value contributed by the current lane. This value is 
converted to a
+  32-bit payload, permuted, and then converted back to type ``T`` as described 
below.
+
+Type ``T`` and Conversions
+--------------------------
+
+The instruction uses a 32-bit payload. The builtin accepts ``T`` only if it 
can map to/from 32 bits.
+
+Accepted ``T``:
+- Scalar integers and floating point
+- Vectors with total size ≤ 32 bits
+- Pointers with representation size ≤ 32 bits
+- C++ classes with a user-defined conversion to a supported 32-bit type
+
+Conversion rules:
+- Builtins <= 32 bits (e.g., ``char``, ``short``, ``int``, ``uint32_t``, 
``float``): bitcast to/from i32.
+- ``double``: convert to ``float`` before permutation; convert back to 
``double`` after (may warn).
+- Long integers (> 32 bits, e.g., ``long long``, ``__int128``): convert to i32 
before permutation; convert back after (may warn).
----------------
yxsamliu wrote:

will do

https://github.com/llvm/llvm-project/pull/153501
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to