================
@@ -61,3 +61,111 @@ Predefined Macros
- Defined if FP64 instruction is available (deprecated).
Please note that the specific architecture and feature names will vary
depending on the GPU. Also, some macros are deprecated and may be removed in
future releases.
+
+AMDGPU Builtins
+===============
+
+Clang provides a set of builtins to access low-level, AMDGPU-specific hardware
features directly from C, C++, OpenCL C, and HIP. These builtins often map
directly to a single machine instruction.
+
+.. _builtin-amdgcn-ds-bpermute:
+
+``__builtin_amdgcn_ds_bpermute``
+--------------------------------
+
+Performs a backward (pull) permutation of values within a wavefront. This
builtin compiles to the
+``ds_bpermute_b32`` instruction and implements a "read from lane" semantic
using a **byte-based**
+address.
+
+**Syntax**
+
+.. code-block:: c++
+
+ T __builtin_amdgcn_ds_bpermute(int index, T src);
+
+**Summary**
+
+All active lanes in the current wavefront conceptually place their ``src``
payloads into an
+internal cross-lane buffer. Each lane then reads a 32-bit value from that
buffer at the byte
+offset given by ``index`` and returns it as type ``T``. The exchange uses LDS
hardware paths
+but does not access user-visible LDS or imply any synchronization.
+
+This builtin is **polymorphic**: the type of ``src`` determines the return
type.
+
+Availability
+------------
+
+- Targets: AMD GCN3 (gfx8) and newer.
+
+Parameters
+----------
+
+- ``index`` (``int``): Byte offset used to select the source lane. Hardware
only consumes bits
+ ``[7:2]``. To read the 32-bit value from lane *i*, pass ``i * 4`` as the
index.
+ Indices that select lanes outside the current wave size or lanes that are
inactive at the call
+ site yield an unspecified value (commonly zero on current hardware).
+
+- ``src`` (``T``): The value contributed by the current lane. This value is
converted to a
+ 32-bit payload, permuted, and then converted back to type ``T`` as described
below.
+
+Type ``T`` and Conversions
+--------------------------
+
+The instruction uses a 32-bit payload. The builtin accepts ``T`` only if it
can map to/from 32 bits.
+
+Accepted ``T``:
+- Scalar integers and floating point
+- Vectors with total size ≤ 32 bits
+- Pointers with representation size ≤ 32 bits
+- C++ classes with a user-defined conversion to a supported 32-bit type
+
+Conversion rules:
+- Builtins <= 32 bits (e.g., ``char``, ``short``, ``int``, ``uint32_t``,
``float``): bitcast to/from i32.
+- ``double``: convert to ``float`` before permutation; convert back to
``double`` after (may warn).
+- Long integers (> 32 bits, e.g., ``long long``, ``__int128``): convert to i32
before permutation; convert back after (may warn).
----------------
yxsamliu wrote:
will do
https://github.com/llvm/llvm-project/pull/153501
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits