llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT--> @llvm/pr-subscribers-backend-amdgpu Author: Shilei Tian (shiltian) <details> <summary>Changes</summary> Add comprehensive documentation for AMDGPU target-specific builtins (`AMDGPUBuiltins.rst`) covering argument semantics, restrictions, and lowering notes for all builtin families. This documentation was generated by AI (Claude) by cross-referencing: - `clang/include/clang/Basic/BuiltinsAMDGPU.td` (builtin definitions) - `llvm/include/llvm/IR/IntrinsicsAMDGPU.td` (intrinsic definitions) - `clang/lib/Sema/SemaAMDGPU.cpp` (argument validation/constraints) - `clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp` (lowering logic) I did my best to proofread the parts I'm familiar with, but it would be greatly appreciated if more people could help review it as well. --- Patch is 69.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/181193.diff 2 Files Affected: - (added) clang/docs/AMDGPUBuiltins.rst (+1807) - (modified) clang/docs/index.rst (+1) ``````````diff diff --git a/clang/docs/AMDGPUBuiltins.rst b/clang/docs/AMDGPUBuiltins.rst new file mode 100644 index 0000000000000..9ca9a2bf3bd32 --- /dev/null +++ b/clang/docs/AMDGPUBuiltins.rst @@ -0,0 +1,1807 @@ +=============== +AMDGPU Builtins +=============== + +.. contents:: + :local: + :depth: 2 + +This document describes the AMDGPU target-specific builtins available in Clang. +Most of these builtins provide direct access to AMDGPU hardware instructions +and intrinsics. They are defined in ``clang/include/clang/Basic/BuiltinsAMDGPU.td`` +and typically lower to LLVM intrinsics defined in +``llvm/include/llvm/IR/IntrinsicsAMDGPU.td``. + +All AMDGPU builtins use the ``__builtin_amdgcn_`` prefix (or ``__builtin_r600_`` +for R600 targets). Arguments marked ``_Constant`` must be compile-time +constant expressions. + +ABI / Special Register Builtins +=============================== + +These builtins provide access to kernel dispatch metadata, work-item and +workgroup identification, and other ABI-level information. They are available +on all SI+ targets. + +Pointer Builtins +---------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``void __constant * __builtin_amdgcn_dispatch_ptr()`` + - Returns a pointer (in constant address space 4) to the dispatch packet + (``hsa_kernel_dispatch_packet_t``). Used internally to derive workgroup + size, grid size, and other dispatch parameters. + * - ``void __constant * __builtin_amdgcn_kernarg_segment_ptr()`` + - Returns a pointer to the beginning of the kernel argument segment. + * - ``void __constant * __builtin_amdgcn_implicitarg_ptr()`` + - Returns a pointer to the implicit arguments appended after explicit + kernel arguments. Layout depends on the code object version. + * - ``void __constant * __builtin_amdgcn_queue_ptr()`` + - Returns a pointer to the ``hsa_queue_t`` object for the queue executing + the current kernel. + +Work-Item and Workgroup Identification +-------------------------------------- + +All of these are ``Const`` (pure) builtins that take no arguments and return +``unsigned int`` (or ``unsigned short`` for workgroup size). + +.. list-table:: + :header-rows: 1 + :widths: 40 30 30 + + * - Builtin + - Return Type + - Description + * - ``__builtin_amdgcn_workgroup_id_{x,y,z}()`` + - ``unsigned int`` + - Workgroup ID in the specified dimension. + * - ``__builtin_amdgcn_workitem_id_{x,y,z}()`` + - ``unsigned int`` + - Work-item (thread) ID within the workgroup. + * - ``__builtin_amdgcn_workgroup_size_{x,y,z}()`` + - ``unsigned short`` + - Workgroup size in the specified dimension. Lowered via a load from the + dispatch or implicit argument pointer, not a dedicated instruction. + * - ``__builtin_amdgcn_grid_size_{x,y,z}()`` + - ``unsigned int`` + - Total grid size in the specified dimension. Lowered via a load from the + dispatch pointer. + +**GFX1250+ Cluster Identification** (requires ``gfx1250-insts``): + +.. list-table:: + :header-rows: 1 + :widths: 50 50 + + * - Builtin + - Description + * - ``__builtin_amdgcn_cluster_id_{x,y,z}()`` + - Cluster ID in the specified dimension. + * - ``__builtin_amdgcn_cluster_workgroup_id_{x,y,z}()`` + - Workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_flat_id()`` + - Flat (linearized) workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_max_id_{x,y,z}()`` + - Maximum workgroup ID within the cluster. + * - ``__builtin_amdgcn_cluster_workgroup_max_flat_id()`` + - Maximum flat workgroup ID within the cluster. + +Other ABI Builtins +------------------ + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``unsigned int __builtin_amdgcn_mbcnt_lo(unsigned int mask, unsigned int val)`` + - Counts the number of set bits in ``mask`` for lanes lower than the + current lane within the lower 32 bits of the exec mask, adds ``val``. + * - ``unsigned int __builtin_amdgcn_mbcnt_hi(unsigned int mask, unsigned int val)`` + - Same as ``mbcnt_lo`` but for the upper 32 bits of the exec mask. + * - ``uint64_t __builtin_amdgcn_s_memtime()`` + - Returns a 64-bit timestamp counter. Requires ``s-memtime-inst``. + +Instruction Builtins +==================== + +Scalar Instruction Builtins +--------------------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``unsigned int __builtin_amdgcn_s_getreg(_Constant int hwreg)`` + - Reads a hardware register. ``hwreg`` is an encoded register specifier + (register ID, offset, and width packed into 16 bits). + * - ``void __builtin_amdgcn_s_setreg(_Constant int hwreg, unsigned int val)`` + - Writes ``val`` to a hardware register. ``hwreg`` must be in + range [0, 65535]. + * - ``uint64_t __builtin_amdgcn_s_getpc()`` + - Returns the current program counter. + * - ``void __builtin_amdgcn_s_waitcnt(_Constant int cnt)`` + - Inserts an ``s_waitcnt`` instruction with the encoded wait count. + * - ``void __builtin_amdgcn_s_sendmsg(_Constant int msg, unsigned int gsdata)`` + - Sends message ``msg`` with GS data in ``gsdata``. + * - ``void __builtin_amdgcn_s_sendmsghalt(_Constant int msg, unsigned int gsdata)`` + - Same as ``s_sendmsg`` but also halts the wavefront. + * - ``void __builtin_amdgcn_s_barrier()`` + - Inserts a workgroup barrier. + * - ``void __builtin_amdgcn_s_ttracedata(int data)`` + - Writes ``data`` to the thread trace buffer. + * - ``void __builtin_amdgcn_s_sleep(_Constant int duration)`` + - Sleeps for approximately ``duration`` cycles. + * - ``void __builtin_amdgcn_s_incperflevel(_Constant int level)`` + - Increments the performance counter level. + * - ``void __builtin_amdgcn_s_decperflevel(_Constant int level)`` + - Decrements the performance counter level. + * - ``void __builtin_amdgcn_s_setprio(_Constant short prio)`` + - Sets the wavefront priority. + * - ``void __builtin_amdgcn_s_dcache_inv()`` + - Invalidates the scalar data cache. + * - ``void __builtin_amdgcn_buffer_wbinvl1()`` + - Write-back and invalidate L1 buffer cache. + * - ``unsigned int __builtin_amdgcn_groupstaticsize()`` + - Returns the size of static LDS allocation in the current workgroup. + * - ``unsigned int __builtin_amdgcn_wavefrontsize()`` + - Returns the wavefront size (32 or 64). + * - ``void __builtin_amdgcn_wave_barrier()`` + - Inserts a wave-level barrier hint. + +Division and Math Builtins +-------------------------- + +Division Support +^^^^^^^^^^^^^^^^ + +These builtins implement steps of the iterative double-precision division +algorithm. + +``__builtin_amdgcn_div_scale`` / ``__builtin_amdgcn_div_scalef`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + double __builtin_amdgcn_div_scale(double numer, double denom, bool select_quotient, bool *flag_out); + float __builtin_amdgcn_div_scalef(float numer, float denom, bool select_quotient, bool *flag_out); + +Scales the numerator or denominator for a subsequent iterative division. + +- ``numer``: The numerator. +- ``denom``: The denominator. +- ``select_quotient``: If ``true``, selects the numerator for scaling; if + ``false``, selects the denominator. +- ``flag_out``: Pointer to a ``bool`` where the overflow/underflow flag is + written. + +**Lowering note**: The underlying intrinsic returns ``{result, flag}`` as a +struct. The builtin unpacks this, returning the result and storing the flag +through the pointer. + +``__builtin_amdgcn_div_fmas`` / ``__builtin_amdgcn_div_fmasf`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + double __builtin_amdgcn_div_fmas(double a, double b, double c, bool vcc); + float __builtin_amdgcn_div_fmasf(float a, float b, float c, bool vcc); + +Fused multiply-add for division, with VCC flag input. + +- ``a``, ``b``, ``c``: FMA operands (computes ``a * b + c``). +- ``vcc``: The flag from ``div_scale``. + +**Lowering note**: The integer ``vcc`` argument is converted to ``i1`` via +``IsNotNull`` before passing to the intrinsic. + +``__builtin_amdgcn_div_fixup`` / ``__builtin_amdgcn_div_fixupf`` / ``__builtin_amdgcn_div_fixuph`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + double __builtin_amdgcn_div_fixup(double a, double b, double c); + float __builtin_amdgcn_div_fixupf(float a, float b, float c); + __fp16 __builtin_amdgcn_div_fixuph(__fp16 a, __fp16 b, __fp16 c); // requires 16-bit-insts + +Applies post-division fixup for special values (NaN, Inf, zero). + +Trigonometric Pre-operation +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: c + + double __builtin_amdgcn_trig_preop(double src, int segment); + float __builtin_amdgcn_trig_preopf(float src, int segment); + +Looks up ``2.0 / pi`` with segment selector ``segment[4:0]`` for range +reduction before trigonometric operations. + +Single-Argument Math Builtins +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +These builtins compute hardware-precision math operations. The ``f32`` versions +(e.g., ``sinf``, ``logf``) may not handle denormals correctly. The ``h``-suffixed +variants require ``16-bit-insts``. + +.. list-table:: + :header-rows: 1 + :widths: 25 25 25 25 + + * - Operation + - f64 + - f32 + - f16 + * - Reciprocal + - ``__builtin_amdgcn_rcp`` + - ``__builtin_amdgcn_rcpf`` + - ``__builtin_amdgcn_rcph`` + * - Square root + - ``__builtin_amdgcn_sqrt`` + - ``__builtin_amdgcn_sqrtf`` + - ``__builtin_amdgcn_sqrth`` + * - Reciprocal sqrt + - ``__builtin_amdgcn_rsq`` + - ``__builtin_amdgcn_rsqf`` + - ``__builtin_amdgcn_rsqh`` + * - Reciprocal sqrt clamp + - ``__builtin_amdgcn_rsq_clamp`` + - ``__builtin_amdgcn_rsq_clampf`` + - + * - Sine (input: turns) + - + - ``__builtin_amdgcn_sinf`` + - ``__builtin_amdgcn_sinh`` + * - Cosine (input: turns) + - + - ``__builtin_amdgcn_cosf`` + - ``__builtin_amdgcn_cosh`` + * - Log2 + - + - ``__builtin_amdgcn_logf`` + - + * - Log clamp + - + - ``__builtin_amdgcn_log_clampf`` + - + * - Exp2 + - + - ``__builtin_amdgcn_exp2f`` + - + * - Fraction + - ``__builtin_amdgcn_fract`` + - ``__builtin_amdgcn_fractf`` + - ``__builtin_amdgcn_fracth`` + * - Mantissa + - ``__builtin_amdgcn_frexp_mant`` + - ``__builtin_amdgcn_frexp_mantf`` + - ``__builtin_amdgcn_frexp_manth`` + * - Exponent + - ``__builtin_amdgcn_frexp_exp`` + - ``__builtin_amdgcn_frexp_expf`` + - ``__builtin_amdgcn_frexp_exph`` + +Note: ``sinf``/``cosf`` take input in **turns** (1.0 = full circle), not +radians. ``logf`` performs ``log2``. ``exp2f`` performs ``2^x``. The ``frexp_exp`` +variants return ``int`` (or ``short`` for f16). + +Ldexp +^^^^^ + +.. code-block:: c + + double __builtin_amdgcn_ldexp(double x, int exp); + float __builtin_amdgcn_ldexpf(float x, int exp); + __fp16 __builtin_amdgcn_ldexph(__fp16 x, int exp); // requires 16-bit-insts + +Computes ``x * 2^exp``. Lowered to the standard ``llvm.ldexp`` intrinsic. +For the ``h`` variant, the exponent is truncated to ``i16``. + +FP Classify +^^^^^^^^^^^ + +.. code-block:: c + + bool __builtin_amdgcn_class(double x, int mask); + bool __builtin_amdgcn_classf(float x, int mask); + bool __builtin_amdgcn_classh(__fp16 x, int mask); // requires 16-bit-insts + +Tests ``x`` against a bitmask of FP classes. Returns ``true`` if ``x`` matches +any of the selected classes. The ``mask`` bits are: + +- Bit 0: Signaling NaN +- Bit 1: Quiet NaN +- Bit 2: Negative infinity +- Bit 3: Negative normal +- Bit 4: Negative denormal +- Bit 5: Negative zero +- Bit 6: Positive zero +- Bit 7: Positive denormal +- Bit 8: Positive normal +- Bit 9: Positive infinity + +Median +^^^^^^ + +.. code-block:: c + + float __builtin_amdgcn_fmed3f(float a, float b, float c); + __fp16 __builtin_amdgcn_fmed3h(__fp16 a, __fp16 b, __fp16 c); // requires gfx9-insts + +Returns the median (middle value) of three floating-point numbers. + +Cube Map Builtins +^^^^^^^^^^^^^^^^^ + +Require ``cube-insts``. All take three floats (x, y, z direction vector +components) and return a float. + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``__builtin_amdgcn_cubeid(x, y, z)`` + - Returns the face ID (0-5) of the cube map. + * - ``__builtin_amdgcn_cubesc(x, y, z)`` + - Returns the S coordinate for the cube face. + * - ``__builtin_amdgcn_cubetc(x, y, z)`` + - Returns the T coordinate for the cube face. + * - ``__builtin_amdgcn_cubema(x, y, z)`` + - Returns the major axis value. + +Data Sharing Builtins +--------------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``int __builtin_amdgcn_ds_swizzle(int data, _Constant int pattern)`` + - Performs a data-parallel swizzle within the wavefront according to the + encoded ``pattern``. + * - ``int __builtin_amdgcn_ds_permute(int addr, int data)`` + - Forward cross-lane permutation. Lane ``i`` gets the ``data`` value from + the lane specified by ``addr / 4``. + * - ``int __builtin_amdgcn_ds_bpermute(int addr, int data)`` + - Backward cross-lane permutation. Lane ``i`` reads from the lane + specified by ``addr / 4``. + * - ``int __builtin_amdgcn_ds_append(int __local *ptr)`` + - Atomically increments the value at ``ptr`` and returns the old value. + The pointer must be in LDS (address space 3). + * - ``int __builtin_amdgcn_ds_consume(int __local *ptr)`` + - Atomically decrements the value at ``ptr`` and returns the new value. + +DS Float Atomics +^^^^^^^^^^^^^^^^ + +.. code-block:: c + + float __builtin_amdgcn_ds_faddf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile); + float __builtin_amdgcn_ds_fminf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile); + float __builtin_amdgcn_ds_fmaxf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile); + +Perform atomic float add/min/max on LDS memory. The ``ordering`` and ``scope`` +arguments are passed through but the operations are lowered to ``AtomicRMW`` +instructions. + +Lane Builtins +------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``int __builtin_amdgcn_readfirstlane(int val)`` + - Returns the value of ``val`` from the first active lane. + * - ``int __builtin_amdgcn_readlane(int val, int lane)`` + - Returns the value of ``val`` from the specified ``lane``. + +Bit Manipulation +---------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``unsigned int __builtin_amdgcn_alignbit(unsigned int hi, unsigned int lo, unsigned int shift)`` + - Concatenates ``hi:lo`` as a 64-bit value and extracts 32 bits starting + at bit ``shift``. Lowered to ``llvm.fshr``. + * - ``unsigned int __builtin_amdgcn_alignbyte(unsigned int hi, unsigned int lo, unsigned int shift)`` + - Same as ``alignbit`` but ``shift`` is in bytes. + * - ``unsigned int __builtin_amdgcn_ubfe(unsigned int base, unsigned int offset, unsigned int width)`` + - Unsigned bitfield extract from ``base`` starting at ``offset`` for + ``width`` bits. + * - ``unsigned int __builtin_amdgcn_sbfe(unsigned int base, unsigned int offset, unsigned int width)`` + - Signed bitfield extract. + * - ``unsigned int __builtin_amdgcn_lerp(unsigned int a, unsigned int b, unsigned int c)`` + - Per-byte unsigned linear interpolation. Requires ``lerp-inst``. + * - ``unsigned int __builtin_amdgcn_perm(unsigned int a, unsigned int b, unsigned int sel)`` + - Byte permutation. ``sel`` encodes which byte of the ``a:b`` pair to + select for each byte of the result. Requires ``gfx8-insts``. + +Conversion Builtins +------------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``half2 __builtin_amdgcn_cvt_pkrtz(float a, float b)`` + - Converts two f32 values to packed f16 with round-to-zero. + * - ``short2 __builtin_amdgcn_cvt_pknorm_i16(float a, float b)`` + - Converts two f32 values to packed normalized i16. Requires + ``cvt-pknorm-vop2-insts``. + * - ``ushort2 __builtin_amdgcn_cvt_pknorm_u16(float a, float b)`` + - Converts two f32 values to packed normalized u16. + * - ``short2 __builtin_amdgcn_cvt_pk_i16(int a, int b)`` + - Packs two i32 values into i16x2. + * - ``ushort2 __builtin_amdgcn_cvt_pk_u16(unsigned int a, unsigned int b)`` + - Packs two u32 values into u16x2. + * - ``unsigned int __builtin_amdgcn_cvt_pk_u8_f32(float val, unsigned int bytesel, unsigned int old)`` + - Converts ``val`` to u8 and inserts at byte ``bytesel`` in ``old``. + * - ``float __builtin_amdgcn_cvt_off_f32_i4(int val)`` + - Converts a 4-bit integer offset to f32. + +SAD (Sum of Absolute Differences) +--------------------------------- + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Builtin + - Description + * - ``unsigned int __builtin_amdgcn_msad_u8(unsigned int a, unsigned int b, unsigned int c)`` + - Masked sum of absolute differences of unsigned 8-bit values. + * - ``unsigned int __builtin_amdgcn_sad_u8(unsigned int a, unsigned int b, unsigned int c)`` + - Sum of absolute differences of unsigned 8-bit values. Requires + ``sad-insts``. + * - ``unsigned int __builtin_amdgcn_sad_hi_u8(unsigned int a, unsigned int b, unsigned int c)`` + - SAD with result in high 16 bits. Requires ``sad-insts``. + * - ``unsigned int __builtin_amdgcn_sad_u16(unsigned int a, unsigned int b, unsigned int c)`` + - SAD of unsigned 16-bit values. Requires ``sad-insts``. + * - ``uint64_t __builtin_amdgcn_qsad_pk_u16_u8(uint64_t a, unsigned int b, uint64_t c)`` + - Quad SAD packed. Requires ``qsad-insts``. + * - ``uint64_t __builtin_amdgcn_mqsad_pk_u16_u8(uint64_t a, unsigned int b, uint64_t c)`` + - Masked quad SAD packed. + * - ``uint4 __builtin_amdgcn_mqsad_u32_u8(uint64_t a, unsigned int b, uint4 c)`` + - Masked quad SAD returning 4x u32. + +Buffer Resource and Load/Store +============================== + +make_buffer_rsrc +---------------- + +.. code-block:: c + + __amdgpu_buffer_rsrc_t __builtin_amdgcn_make_buffer_rsrc(void *base, short stride, int64_t num_records, int flags); + +Constructs a buffer resource descriptor from the given fields: + +- ``base``: Base pointer. +- ``stride``: Stride of structured buffer (0 for raw). +- ``num_records``: Number of records (bytes for raw buffers). +- ``flags``: SRD flags (DST_SEL, NUM_FORMAT, DATA_FORMAT, etc.). + +Raw Buffer Load/Store +--------------------- + +These builtins load/store data through a buffer resource descriptor. + +.. code-block:: c + + // Stores + void __builtin_amdgcn_raw_buffer_store_b{8,16,32,64,96,128}(data, __amdgpu_buffer_rsrc_t rsrc, int offset, int soffset, _Constant int cachepolicy); + // Loads + T __builtin_amdgcn_raw_buffer_load_b{8,16,32,64,96,128}(__amdgpu_buffer_rsrc_t rsrc, int offset, int soffset, _Constant int cachepolicy); + +Arguments: + +- ``rsrc``: Buffer resource descriptor (128-bit SRD, typically SGPR). +- ``offset``: Byte offset (VGPR or immediate). Included in bounds checking and + swizzling. +- ``soffset``: Scalar byte offset (SGPR or immediate). Excluded from bounds + checking and swizzling. +- ``cachepolicy``: Immediate bitfield controlling cache behavior: + + - Pre-GFX12: bit 0 = GLC, bit 1 = SLC, bit 2 = DLC (gfx10/gfx11), + bit 3 = SWZ, bit 4 = SCC (gfx90a). + - GFX942: bit 0 = SC0, bit 1 = NT, bit 3 = SWZ, bit 4 = SC1. + - GFX12+: bits [0:2] = TH, bits [3:4] = scope, bit 6 = SWZ. + - All: bit 31 = volatile (stripped at lowering). + +The data types for each width are: ``b8`` = ``unsigned char``, +``b16`` = ``unsigned short``, ``b32`` = ``unsigned int``, +``b64`` = ``uint2``, ``b96`` = ``uint3``, ``b128`` = ``uint4``. + +Raw Ptr Buffer Atomics +---... [truncated] `````````` </details> https://github.com/llvm/llvm-project/pull/181193 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
