[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types (PR #153501)

Yaxun Liu via cfe-commits Wed, 13 Aug 2025 18:42:28 -0700

https://github.com/yxsamliu updated 
https://github.com/llvm/llvm-project/pull/153501


>From 6619a7102e0c36d005487062cfd9a9071bad3c34 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Wed, 13 Aug 2025 09:18:36 -0400
Subject: [PATCH] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types

This patch enhances the __builtin_amdgcn_ds_bpermute builtin to support a wider 
variety of argument types, improving its usability and flexibility.

Previously, the builtin was restricted to int(int, int), forcing users to 
manually bit-cast arguments and results. This change makes the builtin 
polymorphic, so the return type now matches the type of the source (second) 
argument.

The builtin now accepts various scalar integer and floating-point types, 
pointers, and vector types up to 32 bits. It also supports C++ classes with 
user-defined conversions to a 32-bit type.

The implementation adds custom type-checking in Sema and a dedicated CodeGen 
handler to correctly coerce types to and from the underlying i32 intrinsic. New 
diagnostics are included to warn about truncation and reject oversized vector 
arguments. Comprehensive Sema and CodeGen tests are added to validate the new 
behavior.

This change introduces a new section in AMDGPUSupport.rst documenting Clang's 
AMDGPU-specific builtin functions, starting with __builtin_amdgcn_ds_bpermute.
The goal is to provide clear, centralized reference material for developers 
writing performance-critical GPU kernels, as demand for accessible 
documentation on AMDGPU builtins has grown. Future updates will extend this 
section to cover additional builtins and their semantics, usage patterns, and 
target requirements.
---
 clang/docs/AMDGPUSupport.rst                  | 108 ++++++++++++
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   2 +-
 .../clang/Basic/DiagnosticSemaKinds.td        |   7 +
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   | 117 +++++++++++++
 clang/lib/Sema/SemaAMDGPU.cpp                 | 128 ++++++++++++++
 clang/lib/Sema/SemaChecking.cpp               |   3 +-
 .../CodeGenHIP/builtin-amdgcn-ds-bpermute.hip | 163 ++++++++++++++++++
 .../SemaHIP/builtin-amdgcn-ds-bpermute.hip    |  33 ++++
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td      |   8 +-
 9 files changed, 563 insertions(+), 6 deletions(-)
 create mode 100644 clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
 create mode 100644 clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip

diff --git a/clang/docs/AMDGPUSupport.rst b/clang/docs/AMDGPUSupport.rst
index 3eada5f900613..b6006fe05c485 100644
--- a/clang/docs/AMDGPUSupport.rst
+++ b/clang/docs/AMDGPUSupport.rst
@@ -61,3 +61,111 @@ Predefined Macros
      - Defined if FP64 instruction is available (deprecated).
 
 Please note that the specific architecture and feature names will vary 
depending on the GPU. Also, some macros are deprecated and may be removed in 
future releases.
+
+AMDGPU Builtins
+===============
+
+Clang provides a set of builtins to access low-level, AMDGPU-specific hardware 
features directly from C, C++, OpenCL C, and HIP. These builtins often map 
directly to a single machine instruction.
+
+.. _builtin-amdgcn-ds-bpermute:
+
+``__builtin_amdgcn_ds_bpermute``
+--------------------------------
+
+Performs a backward (pull) permutation of values within a wavefront. This 
builtin compiles to the
+``ds_bpermute_b32`` instruction and implements a "read from lane" semantic 
using a **byte-based**
+address.
+
+**Syntax**
+
+.. code-block:: c++
+
+  T __builtin_amdgcn_ds_bpermute(int index, T src);
+
+**Summary**
+
+All active lanes in the current wavefront conceptually place their ``src`` 
payloads into an
+internal cross-lane buffer. Each lane then reads a 32-bit value from that 
buffer at the byte
+offset given by ``index`` and returns it as type ``T``. The exchange uses LDS 
hardware paths
+but does not access user-visible LDS or imply any synchronization.
+
+This builtin is **polymorphic**: the type of ``src`` determines the return 
type.
+
+Availability
+------------
+
+- Targets: AMD GCN3 (gfx8) and newer.
+
+Parameters
+----------
+
+- ``index`` (``int``): Byte offset used to select the source lane. Hardware 
only consumes bits
+  ``[7:2]``. To read the 32-bit value from lane *i*, pass ``i * 4`` as the 
index.
+  Indices that select lanes outside the current wave size or lanes that are 
inactive at the call
+  site yield an unspecified value (commonly zero on current hardware).
+
+- ``src`` (``T``): The value contributed by the current lane. This value is 
converted to a
+  32-bit payload, permuted, and then converted back to type ``T`` as described 
below.
+
+Type ``T`` and Conversions
+--------------------------
+
+The instruction uses a 32-bit payload. The builtin accepts ``T`` only if it 
can map to/from 32 bits.
+
+Accepted ``T``:
+- Scalar integers and floating point
+- Vectors with total size ≤ 32 bits
+- Pointers with representation size ≤ 32 bits
+- C++ classes with a user-defined conversion to a supported 32-bit type
+
+Conversion rules:
+- Builtins <= 32 bits (e.g., ``char``, ``short``, ``int``, ``uint32_t``, 
``float``): bitcast to/from i32.
+- ``double``: convert to ``float`` before permutation; convert back to 
``double`` after (may warn).
+- Long integers (> 32 bits, e.g., ``long long``, ``__int128``): convert to i32 
before permutation; convert back after (may warn).
+- Vectors:
+  - Size <= 32 bits: bitcast to/from i32.
+  - Size > 32 bits: not allowed (error).
+- Pointers:
+  - Size <= 32 bits: bitcast to/from i32.
+  - Size > 32 bits: not allowed (error).
+- C++ classes: prefer user-defined conversion to a 32-bit type; otherwise use 
a conversion to i32 if available; otherwise reject.
+
+.. note::
+   Narrowing may lose information. For exact bit patterns, use a 32-bit 
trivially copyable type (e.g., ``uint32_t``).
+
+Semantics and Guarantees
+------------------------
+
+* **Active lane participation**: Only lanes active in the EXEC mask at the 
call site
+  contribute a payload. Reading from an inactive source lane produces an 
unspecified value.
+
+* **Index per lane**: ``index`` may vary across lanes. Only bits ``[7:2]`` are 
used for lane
+  selection. Bits outside this range are ignored by hardware.
+
+* **No synchronization**: The builtin does not synchronize lanes or waves and 
does not
+  order memory operations. It doesn't read or write user-visible LDS.
+
+* **Wave size**: Valid source lanes are ``0 .. warpSize-1`` (use 
``warpSize``/equivalent to
+  query 32 vs 64). Selecting lanes outside that range yields an unspecified 
value.
+
+Examples
+--------
+
+Reverse within a wavefront (handles wave32 or wave64):
+
+.. code-block:: c++
+
+  #include <hip/hip_runtime.h>
+
+  __global__ void wavefront_reverse(float* data, int n) {
+    int tid     = blockIdx.x * blockDim.x + threadIdx.x;
+    if (tid >= n) return;
+
+    int lane    = threadIdx.x % warpSize;            // works for 32 or 64
+    int peer    = (warpSize - 1) - lane;             // reversed lane
+    int offset  = peer * 4;                          // byte address
+
+    float my_val      = data[tid];
+    float reversed    = __builtin_amdgcn_ds_bpermute(offset, my_val);
+    data[tid] = reversed;
+  }
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index f8f55772db8fe..8c1011d7bba61 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -121,7 +121,7 @@ BUILTIN(__builtin_amdgcn_s_decperflevel, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_setprio, "vIs", "n")
 BUILTIN(__builtin_amdgcn_ds_swizzle, "iiIi", "nc")
 BUILTIN(__builtin_amdgcn_ds_permute, "iii", "nc")
-BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
+BUILTIN(__builtin_amdgcn_ds_bpermute, "v.", "nct")
 BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
 BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
 BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td 
b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 116341f4b66d5..d57a422f5fda5 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13547,4 +13547,11 @@ def warn_acc_var_referenced_lacks_op
 // AMDGCN builtins diagnostics
 def err_amdgcn_load_lds_size_invalid_value : Error<"invalid size value">;
 def note_amdgcn_load_lds_size_valid_value : Note<"size must be %select{1, 2, 
or 4|1, 2, 4, 12 or 16}0">;
+def err_amdgcn_builtin_vector_pointer_arg_size
+    : Error<"the vector or pointer argument to %0 must have a total size of %1 
"
+            "bits or less, but type %2 has a size of %3 bits">;
+def warn_amdgcn_builtin_arg_truncation
+    : Warning<"the %0-bit argument will be truncated to %1 bits in this call "
+              "to %2">,
+      InGroup<DiagGroup<"amdgcn-builtin-arg-trunc">>;
 } // end of sema component.
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index dad1f95ac710d..91a414418f6a6 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -159,6 +159,119 @@ Value *EmitAMDGPUGridSize(CodeGenFunction &CGF, unsigned 
Index) {
                   llvm::MDNode::get(CGF.getLLVMContext(), {}));
   return LD;
 }
+// Lowers __builtin_amdgcn_ds_bpermute to the corresponding LLVM intrinsic with
+// careful bit-level coercions of operands and result to match Clang types.
+llvm::Value *emitAMDGCNDsBpermute(clang::CodeGen::CodeGenFunction &CGF,
+                                  const clang::CallExpr *Call) {
+  auto &Builder = CGF.Builder;
+  auto &CGM = CGF.CGM;
+  const llvm::DataLayout &DL = CGM.getDataLayout();
+
+  llvm::Type *I32Ty = Builder.getInt32Ty();
+
+  auto GetBitWidth = [&](llvm::Type *Ty) -> unsigned {
+    return DL.getTypeSizeInBits(Ty).getFixedValue();
+  };
+
+  // Coerces arbitrary scalar/vector/pointer to i32 by preserving value/bit
+  // semantics where applicable.
+  auto ToI32Bits = [&](llvm::Value *Val, clang::QualType Qt) -> llvm::Value * {
+    llvm::Type *Ty = Val->getType();
+
+    if (Ty->isIntegerTy()) {
+      unsigned BitWidth = Ty->getIntegerBitWidth();
+      if (BitWidth < 32) {
+        if (Qt->isSignedIntegerType())
+          return Builder.CreateSExt(Val, I32Ty);
+        else
+          return Builder.CreateZExt(Val, I32Ty);
+      } else
+        return Builder.CreateZExtOrTrunc(Val, I32Ty);
+    }
+
+    if (Ty->isPointerTy()) {
+      unsigned PtrBits = DL.getPointerSizeInBits(Ty->getPointerAddressSpace());
+      llvm::Type *IntPtrTy = Builder.getIntNTy(PtrBits);
+      llvm::Value *AsInt = Builder.CreatePtrToInt(Val, IntPtrTy);
+      return Builder.CreateZExtOrTrunc(AsInt, I32Ty);
+    }
+
+    unsigned Bits = GetBitWidth(Ty);
+    llvm::Type *IntN = Builder.getIntNTy(Bits);
+    llvm::Value *AsInt = Builder.CreateBitCast(Val, IntN);
+    return Builder.CreateZExtOrTrunc(AsInt, I32Ty);
+  };
+
+  // Bit-preserving resize/cast between arbitrary source and destination LLVM
+  // types.
+  auto BitCoerceTo = [&](llvm::Value *Val, llvm::Type *DstTy) -> llvm::Value * 
{
+    llvm::Type *SrcTy = Val->getType();
+    if (SrcTy == DstTy)
+      return Val;
+
+    unsigned SrcBits = DL.getTypeSizeInBits(SrcTy).getFixedValue();
+    unsigned DstBits = DL.getTypeSizeInBits(DstTy).getFixedValue();
+
+    if (SrcTy->isIntegerTy() && DstTy->isIntegerTy())
+      return Builder.CreateZExtOrTrunc(Val, DstTy);
+
+    if (SrcBits == DstBits)
+      return Builder.CreateBitCast(Val, DstTy);
+
+    llvm::Type *IntSrcTy = Builder.getIntNTy(SrcBits);
+    llvm::Value *AsInt = Val;
+    if (SrcTy->isPointerTy())
+      AsInt = Builder.CreatePtrToInt(Val, IntSrcTy);
+    else if (!SrcTy->isIntegerTy())
+      AsInt = Builder.CreateBitCast(Val, IntSrcTy);
+
+    llvm::Type *IntDstTy = Builder.getIntNTy(DstBits);
+    llvm::Value *Resized = Builder.CreateZExtOrTrunc(AsInt, IntDstTy);
+
+    if (DstTy->isPointerTy())
+      return Builder.CreateIntToPtr(Resized, DstTy);
+
+    return Builder.CreateBitCast(Resized, DstTy);
+  };
+
+  llvm::Value *Index = CGF.EmitScalarExpr(Call->getArg(0));
+  llvm::Value *Source = CGF.EmitScalarExpr(Call->getArg(1));
+
+  llvm::Type *ReturnTy = CGF.ConvertType(Call->getType());
+
+  llvm::Value *IndexI32 = ToI32Bits(Index, Call->getArg(0)->getType());
+
+  llvm::Value *SourceForIntrinsic;
+  llvm::Type *SourceTy = Source->getType();
+
+  if (SourceTy->isDoubleTy()) {
+    llvm::Value *AsFloat = Builder.CreateFPTrunc(Source, Builder.getFloatTy());
+    SourceForIntrinsic = Builder.CreateBitCast(AsFloat, I32Ty);
+  } else
+    SourceForIntrinsic = ToI32Bits(Source, Call->getArg(1)->getType());
+
+  llvm::Function *IntrinsicFn =
+      CGM.getIntrinsic(llvm::Intrinsic::amdgcn_ds_bpermute);
+
+  llvm::Value *Result =
+      Builder.CreateCall(IntrinsicFn, {IndexI32, SourceForIntrinsic});
+
+  if (ReturnTy->isDoubleTy()) {
+    llvm::Value *AsFloat = Builder.CreateBitCast(Result, Builder.getFloatTy());
+    return Builder.CreateFPExt(AsFloat, ReturnTy);
+  }
+
+  if (ReturnTy->isIntegerTy() && ReturnTy->getIntegerBitWidth() > 32) {
+    clang::QualType SourceQt = Call->getArg(1)->getType();
+    if (SourceQt->isSignedIntegerType())
+      return Builder.CreateSExt(Result, ReturnTy);
+    else
+      return Builder.CreateZExt(Result, ReturnTy);
+  }
+
+  return BitCoerceTo(Result, ReturnTy);
+}
+
 } // namespace
 
 // Generates the IR for __builtin_read_exec_*.
@@ -341,6 +454,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_ds_swizzle:
     return emitBuiltinWithOneOverloadedType<2>(*this, E,
                                                Intrinsic::amdgcn_ds_swizzle);
+
+  case AMDGPU::BI__builtin_amdgcn_ds_bpermute:
+    return emitAMDGCNDsBpermute(*this, E);
+
   case AMDGPU::BI__builtin_amdgcn_mov_dpp8:
   case AMDGPU::BI__builtin_amdgcn_mov_dpp:
   case AMDGPU::BI__builtin_amdgcn_update_dpp: {
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index 1913bb830ccd0..b5b976d2d3a01 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -18,6 +18,132 @@
 #include "llvm/Support/AtomicOrdering.h"
 #include <cstdint>
 
+namespace {
+
+using llvm::StringRef;
+using namespace clang;
+
+/// Attempts to apply a user-defined conversion on Arg at ArgIndex to a
+/// 32-bit-compatible type. If successful, updates TheCall's argument. Returns
+/// true if a suitable conversion was applied.
+bool tryUserDefinedConversion32Bit(Sema &SemaRef, Expr *Arg, CallExpr *TheCall,
+                                   unsigned ArgIndex) {
+  const CXXRecordDecl *RecordDecl = Arg->getType()->getAsCXXRecordDecl();
+  if (!RecordDecl)
+    return false;
+
+  // Iterate over class conversion operators and pick the first that yields a
+  // 32-bit type.
+  for (auto *MethodDecl : RecordDecl->methods()) {
+    if (auto *ConversionDecl = dyn_cast<CXXConversionDecl>(MethodDecl)) {
+      QualType ConvType = ConversionDecl->getConversionType();
+
+      bool Is32Bit = false;
+      auto SizeIs32 = [&](QualType T) {
+        return SemaRef.Context.getTypeSize(T) == 32;
+      };
+
+      // Classify 32-bit-compatible target types with target-dependent size
+      // checks where needed.
+      if (const auto *BT = ConvType->getAs<BuiltinType>())
+        Is32Bit = (BT->getKind() == BuiltinType::Float) ||
+                  ((BT->getKind() == BuiltinType::Int ||
+                    BT->getKind() == BuiltinType::UInt) &&
+                   SizeIs32(ConvType));
+      else if (ConvType->isPointerType() || ConvType->isVectorType())
+        Is32Bit = SizeIs32(ConvType);
+
+      if (Is32Bit) {
+        ExprResult ConvResult = SemaRef.PerformImplicitConversion(
+            Arg, ConvType, AssignmentAction::Converting);
+        if (ConvResult.isInvalid())
+          return false;
+
+        TheCall->setArg(ArgIndex, ConvResult.get());
+        return true;
+      }
+    }
+  }
+
+  return false;
+}
+
+/// Handles and coerces a "payload" argument at ArgIndex to a 32-bit-compatible
+/// type. On success, also sets the call result type to the argument's 
resulting
+/// type. Returns true on error (diagnosed), false on success.
+bool handle32BitPayloadArg(Sema &SemaRef, CallExpr *TheCall, unsigned ArgIndex,
+                           StringRef BuiltinName) {
+  ASTContext &AstContext = SemaRef.getASTContext();
+  Expr *Arg = TheCall->getArg(ArgIndex);
+  QualType Type = Arg->getType();
+  QualType Int32Ty = AstContext.IntTy;
+
+  if (Type->isVectorType() || Type->isPointerType()) {
+    uint64_t Size = AstContext.getTypeSize(Type);
+    if (Size > 32) {
+      SemaRef.Diag(Arg->getBeginLoc(),
+                   diag::err_amdgcn_builtin_vector_pointer_arg_size)
+          << BuiltinName << 32 << Type << Size << Arg->getSourceRange();
+      return true;
+    }
+  } else if (Type->isScalarType()) {
+    uint64_t Size = AstContext.getTypeSize(Type);
+    if (Size > 32)
+      SemaRef.Diag(Arg->getBeginLoc(), 
diag::warn_amdgcn_builtin_arg_truncation)
+          << Size << 32 << BuiltinName << Arg->getSourceRange();
+  } else {
+    // Prefer user-defined conversion operators that yield a 32-bit-compatible
+    // type. If none apply, fall back to an implicit int32 conversion.
+    if (!tryUserDefinedConversion32Bit(SemaRef, Arg, TheCall, ArgIndex)) {
+      ExprResult ConvResult = SemaRef.PerformImplicitConversion(
+          Arg, Int32Ty, AssignmentAction::Converting);
+      if (ConvResult.isInvalid())
+        return true;
+      TheCall->setArg(ArgIndex, ConvResult.get());
+    }
+  }
+
+  TheCall->setType(TheCall->getArg(ArgIndex)->getType());
+  return false;
+}
+
+/// Validates and coerces the arguments to __builtin_amdgcn_ds_bpermute.
+/// Ensures arg0 is int32 (with truncation warning as needed) and applies the
+/// 32-bit payload handler to arg1.
+bool checkDsBpermuteFunctionCall(Sema &SemaRef, CallExpr *TheCall) {
+  if (SemaRef.checkArgCount(TheCall, 2))
+    return true;
+
+  ASTContext &AstContext = SemaRef.getASTContext();
+
+  const FunctionDecl *FuncDecl = TheCall->getDirectCallee();
+  StringRef BuiltinName = FuncDecl ? FuncDecl->getName()
+                                   : StringRef("__builtin_amdgcn_ds_bpermute");
+
+  Expr *IndexArg = TheCall->getArg(0);
+  QualType Int32Ty = AstContext.IntTy;
+
+  if (AstContext.getTypeSize(IndexArg->getType()) > 32 &&
+      !IndexArg->getType()->isRecordType())
+    SemaRef.Diag(IndexArg->getBeginLoc(),
+                 diag::warn_amdgcn_builtin_arg_truncation)
+        << AstContext.getTypeSize(IndexArg->getType()) << 32 << BuiltinName
+        << IndexArg->getSourceRange();
+
+  ExprResult ConvResult = SemaRef.PerformImplicitConversion(
+      IndexArg, Int32Ty, AssignmentAction::Converting);
+  if (ConvResult.isInvalid())
+    return true;
+  TheCall->setArg(0, ConvResult.get());
+
+  if (handle32BitPayloadArg(SemaRef, TheCall, /*ArgIndex=*/1, BuiltinName))
+    return true;
+
+  return false;
+}
+
+} // anonymous namespace
+
 namespace clang {
 
 SemaAMDGPU::SemaAMDGPU(Sema &S) : SemaBase(S) {}
@@ -100,6 +226,8 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_cvt_scale_pk16_f32_fp6:
   case AMDGPU::BI__builtin_amdgcn_cvt_scale_pk16_f32_bf6:
     return SemaRef.BuiltinConstantArgRange(TheCall, 2, 0, 7);
+  case AMDGPU::BI__builtin_amdgcn_ds_bpermute:
+    return checkDsBpermuteFunctionCall(SemaRef, TheCall);
   default:
     return false;
   }
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index 907740374dbfe..921f0b037b9f5 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -2078,7 +2078,8 @@ bool Sema::CheckTSBuiltinFunctionCall(const TargetInfo 
&TI, unsigned BuiltinID,
   case llvm::Triple::spirv64:
     if (TI.getTriple().getOS() != llvm::Triple::OSType::AMDHSA)
       return SPIRV().CheckSPIRVBuiltinFunctionCall(TI, BuiltinID, TheCall);
-    return false;
+    else
+      return AMDGPU().CheckAMDGCNBuiltinFunctionCall(BuiltinID, TheCall);
   case llvm::Triple::systemz:
     return SystemZ().CheckSystemZBuiltinFunctionCall(BuiltinID, TheCall);
   case llvm::Triple::x86:
diff --git a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
new file mode 100644
index 0000000000000..00c02ccc7638f
--- /dev/null
+++ b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
@@ -0,0 +1,163 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1200 -aux-triple 
x86_64-pc-linux-gnu \
+// RUN:    -fcuda-is-device -emit-llvm -o - %s | FileCheck %s
+
+typedef short short2 __attribute__((vector_size(4)));
+
+#define __device__ __attribute__((device))
+
+// CHECK-LABEL: define{{.*}}@test_index_i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 %1)
+extern "C" __device__ int test_index_i32(int a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_long
+// CHECK: [[TRUNC:%.*]] = trunc i64 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[TRUNC]], i32 %1)
+extern "C" __device__ int test_index_long(long a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_short
+// CHECK: [[EXT:%.*]] = sext i16 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[EXT]], i32 %1)
+extern "C" __device__ int test_index_short(short a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_float
+// CHECK: [[CONV:%.*]] = fptosi float %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[CONV]], i32 %1)
+extern "C" __device__ int test_index_float(float a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_struct
+// CHECK: [[CALL:%.*]] = call noundef i32 @_ZNK11ConvertiblecviEv(
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[CALL]], i32 %{{[0-9]+}})
+struct Convertible {
+  int value;
+  __device__ operator int() const { return value; }
+};
+
+extern "C" __device__ int test_index_struct(Convertible a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+template<typename T>
+__device__ int test_template(T a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_template_short
+// CHECK: [[EXT:%.*]] = sext i16 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[EXT]], i32 %1)
+extern "C" __device__ int test_template_short(short a, int c) {
+   return test_template<short>(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_float
+// CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: ret float [[RESULT]]
+extern "C" __device__ float test_source_float(int a, float c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_v2i16
+// CHECK: [[BITCAST:%.*]] = bitcast <2 x i16> %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to <2 x i16>
+// CHECK: ret <2 x i16> [[RESULT]]
+extern "C" __device__ short2 test_source_v2i16(int a, short2 c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_float16
+// CHECK: [[BITCAST:%.*]] = bitcast half %1 to i16
+// CHECK: [[ZEXT:%.*]] = zext i16 [[BITCAST]] to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: [[RESULT:%.*]] = bitcast i16 [[TRUNC]] to half
+// CHECK: ret half [[RESULT]]
+extern "C" __device__ _Float16 test_source_float16(int a, _Float16 c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_short
+// CHECK: [[SEXT:%.*]] = sext i16 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[SEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: ret i16 [[TRUNC]]
+extern "C" __device__ short test_source_short(int a, short c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ushort
+// CHECK: [[ZEXT:%.*]] = zext i16 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: ret i16 [[TRUNC]]
+extern "C" __device__ unsigned short test_source_ushort(int a, unsigned short 
c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_long
+// CHECK: [[TRUNC:%.*]] = trunc i64 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[TRUNC]])
+// CHECK: [[SEXT:%.*]] = sext i32 [[CALL]] to i64
+// CHECK: ret i64 [[SEXT]]
+extern "C" __device__ long test_source_long(int a, long c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ulong
+// CHECK: [[TRUNC:%.*]] = trunc i64 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[TRUNC]])
+// CHECK: [[ZEXT:%.*]] = zext i32 [[CALL]] to i64
+// CHECK: ret i64 [[ZEXT]]
+extern "C" __device__ unsigned long test_source_ulong(int a, unsigned long c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_double
+// CHECK: [[FPTRUNC:%.*]] = fptrunc contract double %1 to float
+// CHECK: [[BITCAST:%.*]] = bitcast float [[FPTRUNC]] to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[BITCAST2:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: [[FPEXT:%.*]] = fpext contract float [[BITCAST2]] to double
+// CHECK: ret double [[FPEXT]]
+extern "C" __device__ double test_source_double(int a, double c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_template_float_src
+// CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: ret float [[RESULT]]
+template<typename T>
+__device__ T test_template_src(int a, T c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+extern "C" __device__ float test_template_float_src(int a, float c) {
+   return test_template_src<float>(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_struct_float
+// CHECK: [[CALL:%.*]] = call {{.*}} float @_ZNK16FloatConvertiblecvfEv(
+// CHECK: [[CONV:%.*]] = bitcast float [[CALL]] to i32
+// CHECK: [[RESULT:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[CONV]])
+// CHECK: [[CONV2:%.*]] = bitcast i32 [[RESULT]] to float
+// CHECK: ret float [[CONV2]]
+struct FloatConvertible {
+  float value;
+  __device__ operator float() const { return value; }
+};
+
+extern "C" __device__ float test_source_struct_float(int a, FloatConvertible 
c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
diff --git a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
new file mode 100644
index 0000000000000..937d28a117af8
--- /dev/null
+++ b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
@@ -0,0 +1,33 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -fsyntax-only -verify -triple amdgcn-amd-amdhsa -aux-triple 
x86_64-pc-linux-gnu -Wno-unused-value %s
+// RUN: %clang_cc1 -fsyntax-only -verify -triple x86_64-pc-linux-gnu 
-aux-triple amdgcn -Wno-unused-value %s
+#define __device__ __attribute__((device))
+
+typedef short short2 __attribute__((ext_vector_type(2)));
+typedef int int2 __attribute__((ext_vector_type(2)));
+
+struct A { int a; };
+
+__device__ void test_invalid_num_args(int a, int b, int c) {
+   __builtin_amdgcn_ds_bpermute(a, b, c); // expected-error {{too many 
arguments to function call, expected 2, have 3}}
+}
+
+__device__ void test_invalid_index(short2 a, int b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{converting 'short2' 
(vector of 2 'short' values) to incompatible type 'int'}}
+}
+
+__device__ void test_invalid_vector_src(int a, int2 b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{the vector or 
pointer argument to __builtin_amdgcn_ds_bpermute must have a total size of 32 
bits or less, but type 'int2' (vector of 2 'int' values) has a size of 64 bits}}
+}
+
+__device__ void test_warn_long_src(int a, long b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-warning {{the 64-bit 
argument will be truncated to 32 bits in this call to 
__builtin_amdgcn_ds_bpermute}}
+}
+
+__device__ void test_warn_pointer_src(int a, void* b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{the vector or 
pointer argument to __builtin_amdgcn_ds_bpermute must have a total size of 32 
bits or less, but type 'void *' has a size of 64 bits}}
+}
+
+__device__ void test_invalid_struct_src(int a, A b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{converting 'A' to 
incompatible type 'int'}}
+}
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index cf82f7f06a693..e7e360ca21879 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2754,10 +2754,10 @@ def int_amdgcn_ds_permute :
     [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // llvm.amdgcn.ds.bpermute <index> <src>
-def int_amdgcn_ds_bpermute :
-  ClangBuiltin<"__builtin_amdgcn_ds_bpermute">,
-  Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
-     [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+def int_amdgcn_ds_bpermute
+    : Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], [
+      IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree
+    ]>;
 
 // llvm.amdgcn.perm <src0> <src1> <selector>
 def int_amdgcn_perm :

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types (PR #153501)

Reply via email to