================ @@ -61,3 +61,111 @@ Predefined Macros - Defined if FP64 instruction is available (deprecated). Please note that the specific architecture and feature names will vary depending on the GPU. Also, some macros are deprecated and may be removed in future releases. + +AMDGPU Builtins +=============== + +Clang provides a set of builtins to access low-level, AMDGPU-specific hardware features directly from C, C++, OpenCL C, and HIP. These builtins often map directly to a single machine instruction. + +.. _builtin-amdgcn-ds-bpermute: + +``__builtin_amdgcn_ds_bpermute`` +-------------------------------- + +Performs a backward (pull) permutation of values within a wavefront. This builtin compiles to the +``ds_bpermute_b32`` instruction and implements a "read from lane" semantic using a **byte-based** +address. + +**Syntax** + +.. code-block:: c++ + + T __builtin_amdgcn_ds_bpermute(int index, T src); + +**Summary** + +All active lanes in the current wavefront conceptually place their ``src`` payloads into an +internal cross-lane buffer. Each lane then reads a 32-bit value from that buffer at the byte +offset given by ``index`` and returns it as type ``T``. The exchange uses LDS hardware paths +but does not access user-visible LDS or imply any synchronization. + +This builtin is **polymorphic**: the type of ``src`` determines the return type. + +Availability +------------ + +- Targets: AMD GCN3 (gfx8) and newer. + +Parameters +---------- + +- ``index`` (``int``): Byte offset used to select the source lane. Hardware only consumes bits + ``[7:2]``. To read the 32-bit value from lane *i*, pass ``i * 4`` as the index. + Indices that select lanes outside the current wave size or lanes that are inactive at the call + site yield an unspecified value (commonly zero on current hardware). + +- ``src`` (``T``): The value contributed by the current lane. This value is converted to a + 32-bit payload, permuted, and then converted back to type ``T`` as described below. + +Type ``T`` and Conversions +-------------------------- + +The instruction uses a 32-bit payload. The builtin accepts ``T`` only if it can map to/from 32 bits. + +Accepted ``T``: +- Scalar integers and floating point +- Vectors with total size ≤ 32 bits +- Pointers with representation size ≤ 32 bits +- C++ classes with a user-defined conversion to a supported 32-bit type + +Conversion rules: +- Builtins <= 32 bits (e.g., ``char``, ``short``, ``int``, ``uint32_t``, ``float``): bitcast to/from i32. +- ``double``: convert to ``float`` before permutation; convert back to ``double`` after (may warn). +- Long integers (> 32 bits, e.g., ``long long``, ``__int128``): convert to i32 before permutation; convert back after (may warn). ---------------- yxsamliu wrote:
will do https://github.com/llvm/llvm-project/pull/153501 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits