[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types (PR #153501)

Yaxun Liu via cfe-commits Thu, 21 Aug 2025 22:19:09 -0700

https://github.com/yxsamliu updated 
https://github.com/llvm/llvm-project/pull/153501


>From 8fef1b8dd9d91513c705c460e868ac5507eb4aea Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Wed, 13 Aug 2025 09:18:36 -0400
Subject: [PATCH 1/7] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument
 types

This patch enhances the __builtin_amdgcn_ds_bpermute builtin to support a wider 
variety of argument types, improving its usability and flexibility.

Previously, the builtin was restricted to int(int, int), forcing users to 
manually bit-cast arguments and results. This change makes the builtin 
polymorphic, so the return type now matches the type of the source (second) 
argument.

The builtin now accepts various scalar integer and floating-point types, 
pointers, and vector types up to 32 bits. It also supports C++ classes with 
user-defined conversions to a 32-bit type.

The implementation adds custom type-checking in Sema and a dedicated CodeGen 
handler to correctly coerce types to and from the underlying i32 intrinsic. New 
diagnostics are included to warn about truncation and reject oversized vector 
arguments. Comprehensive Sema and CodeGen tests are added to validate the new 
behavior.

This change introduces a new section in AMDGPUSupport.rst documenting Clang's 
AMDGPU-specific builtin functions, starting with __builtin_amdgcn_ds_bpermute.
The goal is to provide clear, centralized reference material for developers 
writing performance-critical GPU kernels, as demand for accessible 
documentation on AMDGPU builtins has grown. Future updates will extend this 
section to cover additional builtins and their semantics, usage patterns, and 
target requirements.
---
 clang/docs/AMDGPUSupport.rst                  | 108 ++++++++++++
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   2 +-
 .../clang/Basic/DiagnosticSemaKinds.td        |   7 +
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   | 117 +++++++++++++
 clang/lib/Sema/SemaAMDGPU.cpp                 | 128 ++++++++++++++
 clang/lib/Sema/SemaChecking.cpp               |   3 +-
 .../CodeGenHIP/builtin-amdgcn-ds-bpermute.hip | 163 ++++++++++++++++++
 .../SemaHIP/builtin-amdgcn-ds-bpermute.hip    |  33 ++++
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td      |   8 +-
 9 files changed, 563 insertions(+), 6 deletions(-)
 create mode 100644 clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
 create mode 100644 clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip

diff --git a/clang/docs/AMDGPUSupport.rst b/clang/docs/AMDGPUSupport.rst
index 3eada5f900613..b6006fe05c485 100644
--- a/clang/docs/AMDGPUSupport.rst
+++ b/clang/docs/AMDGPUSupport.rst
@@ -61,3 +61,111 @@ Predefined Macros
      - Defined if FP64 instruction is available (deprecated).
 
 Please note that the specific architecture and feature names will vary 
depending on the GPU. Also, some macros are deprecated and may be removed in 
future releases.
+
+AMDGPU Builtins
+===============
+
+Clang provides a set of builtins to access low-level, AMDGPU-specific hardware 
features directly from C, C++, OpenCL C, and HIP. These builtins often map 
directly to a single machine instruction.
+
+.. _builtin-amdgcn-ds-bpermute:
+
+``__builtin_amdgcn_ds_bpermute``
+--------------------------------
+
+Performs a backward (pull) permutation of values within a wavefront. This 
builtin compiles to the
+``ds_bpermute_b32`` instruction and implements a "read from lane" semantic 
using a **byte-based**
+address.
+
+**Syntax**
+
+.. code-block:: c++
+
+  T __builtin_amdgcn_ds_bpermute(int index, T src);
+
+**Summary**
+
+All active lanes in the current wavefront conceptually place their ``src`` 
payloads into an
+internal cross-lane buffer. Each lane then reads a 32-bit value from that 
buffer at the byte
+offset given by ``index`` and returns it as type ``T``. The exchange uses LDS 
hardware paths
+but does not access user-visible LDS or imply any synchronization.
+
+This builtin is **polymorphic**: the type of ``src`` determines the return 
type.
+
+Availability
+------------
+
+- Targets: AMD GCN3 (gfx8) and newer.
+
+Parameters
+----------
+
+- ``index`` (``int``): Byte offset used to select the source lane. Hardware 
only consumes bits
+  ``[7:2]``. To read the 32-bit value from lane *i*, pass ``i * 4`` as the 
index.
+  Indices that select lanes outside the current wave size or lanes that are 
inactive at the call
+  site yield an unspecified value (commonly zero on current hardware).
+
+- ``src`` (``T``): The value contributed by the current lane. This value is 
converted to a
+  32-bit payload, permuted, and then converted back to type ``T`` as described 
below.
+
+Type ``T`` and Conversions
+--------------------------
+
+The instruction uses a 32-bit payload. The builtin accepts ``T`` only if it 
can map to/from 32 bits.
+
+Accepted ``T``:
+- Scalar integers and floating point
+- Vectors with total size ≤ 32 bits
+- Pointers with representation size ≤ 32 bits
+- C++ classes with a user-defined conversion to a supported 32-bit type
+
+Conversion rules:
+- Builtins <= 32 bits (e.g., ``char``, ``short``, ``int``, ``uint32_t``, 
``float``): bitcast to/from i32.
+- ``double``: convert to ``float`` before permutation; convert back to 
``double`` after (may warn).
+- Long integers (> 32 bits, e.g., ``long long``, ``__int128``): convert to i32 
before permutation; convert back after (may warn).
+- Vectors:
+  - Size <= 32 bits: bitcast to/from i32.
+  - Size > 32 bits: not allowed (error).
+- Pointers:
+  - Size <= 32 bits: bitcast to/from i32.
+  - Size > 32 bits: not allowed (error).
+- C++ classes: prefer user-defined conversion to a 32-bit type; otherwise use 
a conversion to i32 if available; otherwise reject.
+
+.. note::
+   Narrowing may lose information. For exact bit patterns, use a 32-bit 
trivially copyable type (e.g., ``uint32_t``).
+
+Semantics and Guarantees
+------------------------
+
+* **Active lane participation**: Only lanes active in the EXEC mask at the 
call site
+  contribute a payload. Reading from an inactive source lane produces an 
unspecified value.
+
+* **Index per lane**: ``index`` may vary across lanes. Only bits ``[7:2]`` are 
used for lane
+  selection. Bits outside this range are ignored by hardware.
+
+* **No synchronization**: The builtin does not synchronize lanes or waves and 
does not
+  order memory operations. It doesn't read or write user-visible LDS.
+
+* **Wave size**: Valid source lanes are ``0 .. warpSize-1`` (use 
``warpSize``/equivalent to
+  query 32 vs 64). Selecting lanes outside that range yields an unspecified 
value.
+
+Examples
+--------
+
+Reverse within a wavefront (handles wave32 or wave64):
+
+.. code-block:: c++
+
+  #include <hip/hip_runtime.h>
+
+  __global__ void wavefront_reverse(float* data, int n) {
+    int tid     = blockIdx.x * blockDim.x + threadIdx.x;
+    if (tid >= n) return;
+
+    int lane    = threadIdx.x % warpSize;            // works for 32 or 64
+    int peer    = (warpSize - 1) - lane;             // reversed lane
+    int offset  = peer * 4;                          // byte address
+
+    float my_val      = data[tid];
+    float reversed    = __builtin_amdgcn_ds_bpermute(offset, my_val);
+    data[tid] = reversed;
+  }
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index f8f55772db8fe..8c1011d7bba61 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -121,7 +121,7 @@ BUILTIN(__builtin_amdgcn_s_decperflevel, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_setprio, "vIs", "n")
 BUILTIN(__builtin_amdgcn_ds_swizzle, "iiIi", "nc")
 BUILTIN(__builtin_amdgcn_ds_permute, "iii", "nc")
-BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
+BUILTIN(__builtin_amdgcn_ds_bpermute, "v.", "nct")
 BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
 BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
 BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td 
b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 116341f4b66d5..d57a422f5fda5 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13547,4 +13547,11 @@ def warn_acc_var_referenced_lacks_op
 // AMDGCN builtins diagnostics
 def err_amdgcn_load_lds_size_invalid_value : Error<"invalid size value">;
 def note_amdgcn_load_lds_size_valid_value : Note<"size must be %select{1, 2, 
or 4|1, 2, 4, 12 or 16}0">;
+def err_amdgcn_builtin_vector_pointer_arg_size
+    : Error<"the vector or pointer argument to %0 must have a total size of %1 
"
+            "bits or less, but type %2 has a size of %3 bits">;
+def warn_amdgcn_builtin_arg_truncation
+    : Warning<"the %0-bit argument will be truncated to %1 bits in this call "
+              "to %2">,
+      InGroup<DiagGroup<"amdgcn-builtin-arg-trunc">>;
 } // end of sema component.
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index dad1f95ac710d..91a414418f6a6 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -159,6 +159,119 @@ Value *EmitAMDGPUGridSize(CodeGenFunction &CGF, unsigned 
Index) {
                   llvm::MDNode::get(CGF.getLLVMContext(), {}));
   return LD;
 }
+// Lowers __builtin_amdgcn_ds_bpermute to the corresponding LLVM intrinsic with
+// careful bit-level coercions of operands and result to match Clang types.
+llvm::Value *emitAMDGCNDsBpermute(clang::CodeGen::CodeGenFunction &CGF,
+                                  const clang::CallExpr *Call) {
+  auto &Builder = CGF.Builder;
+  auto &CGM = CGF.CGM;
+  const llvm::DataLayout &DL = CGM.getDataLayout();
+
+  llvm::Type *I32Ty = Builder.getInt32Ty();
+
+  auto GetBitWidth = [&](llvm::Type *Ty) -> unsigned {
+    return DL.getTypeSizeInBits(Ty).getFixedValue();
+  };
+
+  // Coerces arbitrary scalar/vector/pointer to i32 by preserving value/bit
+  // semantics where applicable.
+  auto ToI32Bits = [&](llvm::Value *Val, clang::QualType Qt) -> llvm::Value * {
+    llvm::Type *Ty = Val->getType();
+
+    if (Ty->isIntegerTy()) {
+      unsigned BitWidth = Ty->getIntegerBitWidth();
+      if (BitWidth < 32) {
+        if (Qt->isSignedIntegerType())
+          return Builder.CreateSExt(Val, I32Ty);
+        else
+          return Builder.CreateZExt(Val, I32Ty);
+      } else
+        return Builder.CreateZExtOrTrunc(Val, I32Ty);
+    }
+
+    if (Ty->isPointerTy()) {
+      unsigned PtrBits = DL.getPointerSizeInBits(Ty->getPointerAddressSpace());
+      llvm::Type *IntPtrTy = Builder.getIntNTy(PtrBits);
+      llvm::Value *AsInt = Builder.CreatePtrToInt(Val, IntPtrTy);
+      return Builder.CreateZExtOrTrunc(AsInt, I32Ty);
+    }
+
+    unsigned Bits = GetBitWidth(Ty);
+    llvm::Type *IntN = Builder.getIntNTy(Bits);
+    llvm::Value *AsInt = Builder.CreateBitCast(Val, IntN);
+    return Builder.CreateZExtOrTrunc(AsInt, I32Ty);
+  };
+
+  // Bit-preserving resize/cast between arbitrary source and destination LLVM
+  // types.
+  auto BitCoerceTo = [&](llvm::Value *Val, llvm::Type *DstTy) -> llvm::Value * 
{
+    llvm::Type *SrcTy = Val->getType();
+    if (SrcTy == DstTy)
+      return Val;
+
+    unsigned SrcBits = DL.getTypeSizeInBits(SrcTy).getFixedValue();
+    unsigned DstBits = DL.getTypeSizeInBits(DstTy).getFixedValue();
+
+    if (SrcTy->isIntegerTy() && DstTy->isIntegerTy())
+      return Builder.CreateZExtOrTrunc(Val, DstTy);
+
+    if (SrcBits == DstBits)
+      return Builder.CreateBitCast(Val, DstTy);
+
+    llvm::Type *IntSrcTy = Builder.getIntNTy(SrcBits);
+    llvm::Value *AsInt = Val;
+    if (SrcTy->isPointerTy())
+      AsInt = Builder.CreatePtrToInt(Val, IntSrcTy);
+    else if (!SrcTy->isIntegerTy())
+      AsInt = Builder.CreateBitCast(Val, IntSrcTy);
+
+    llvm::Type *IntDstTy = Builder.getIntNTy(DstBits);
+    llvm::Value *Resized = Builder.CreateZExtOrTrunc(AsInt, IntDstTy);
+
+    if (DstTy->isPointerTy())
+      return Builder.CreateIntToPtr(Resized, DstTy);
+
+    return Builder.CreateBitCast(Resized, DstTy);
+  };
+
+  llvm::Value *Index = CGF.EmitScalarExpr(Call->getArg(0));
+  llvm::Value *Source = CGF.EmitScalarExpr(Call->getArg(1));
+
+  llvm::Type *ReturnTy = CGF.ConvertType(Call->getType());
+
+  llvm::Value *IndexI32 = ToI32Bits(Index, Call->getArg(0)->getType());
+
+  llvm::Value *SourceForIntrinsic;
+  llvm::Type *SourceTy = Source->getType();
+
+  if (SourceTy->isDoubleTy()) {
+    llvm::Value *AsFloat = Builder.CreateFPTrunc(Source, Builder.getFloatTy());
+    SourceForIntrinsic = Builder.CreateBitCast(AsFloat, I32Ty);
+  } else
+    SourceForIntrinsic = ToI32Bits(Source, Call->getArg(1)->getType());
+
+  llvm::Function *IntrinsicFn =
+      CGM.getIntrinsic(llvm::Intrinsic::amdgcn_ds_bpermute);
+
+  llvm::Value *Result =
+      Builder.CreateCall(IntrinsicFn, {IndexI32, SourceForIntrinsic});
+
+  if (ReturnTy->isDoubleTy()) {
+    llvm::Value *AsFloat = Builder.CreateBitCast(Result, Builder.getFloatTy());
+    return Builder.CreateFPExt(AsFloat, ReturnTy);
+  }
+
+  if (ReturnTy->isIntegerTy() && ReturnTy->getIntegerBitWidth() > 32) {
+    clang::QualType SourceQt = Call->getArg(1)->getType();
+    if (SourceQt->isSignedIntegerType())
+      return Builder.CreateSExt(Result, ReturnTy);
+    else
+      return Builder.CreateZExt(Result, ReturnTy);
+  }
+
+  return BitCoerceTo(Result, ReturnTy);
+}
+
 } // namespace
 
 // Generates the IR for __builtin_read_exec_*.
@@ -341,6 +454,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_ds_swizzle:
     return emitBuiltinWithOneOverloadedType<2>(*this, E,
                                                Intrinsic::amdgcn_ds_swizzle);
+
+  case AMDGPU::BI__builtin_amdgcn_ds_bpermute:
+    return emitAMDGCNDsBpermute(*this, E);
+
   case AMDGPU::BI__builtin_amdgcn_mov_dpp8:
   case AMDGPU::BI__builtin_amdgcn_mov_dpp:
   case AMDGPU::BI__builtin_amdgcn_update_dpp: {
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index 1913bb830ccd0..e29e8e7fbc3d4 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -18,6 +18,132 @@
 #include "llvm/Support/AtomicOrdering.h"
 #include <cstdint>
 
+namespace {
+
+using llvm::StringRef;
+using namespace clang;
+
+/// Attempts to apply a user-defined conversion on Arg at ArgIndex to a
+/// 32-bit-compatible type. If successful, updates TheCall's argument. Returns
+/// true if a suitable conversion was applied.
+bool tryUserDefinedConversion32Bit(Sema &SemaRef, Expr *Arg, CallExpr *TheCall,
+                                   unsigned ArgIndex) {
+  const CXXRecordDecl *RecordDecl = Arg->getType()->getAsCXXRecordDecl();
+  if (!RecordDecl)
+    return false;
+
+  // Iterate over class conversion operators and pick the first that yields a
+  // 32-bit type.
+  for (CXXMethodDecl *MethodDecl : RecordDecl->methods()) {
+    if (auto *ConversionDecl = dyn_cast<CXXConversionDecl>(MethodDecl)) {
+      QualType ConvType = ConversionDecl->getConversionType();
+
+      bool Is32Bit = false;
+      auto SizeIs32 = [&](QualType T) {
+        return SemaRef.Context.getTypeSize(T) == 32;
+      };
+
+      // Classify 32-bit-compatible target types with target-dependent size
+      // checks where needed.
+      if (const auto *BT = ConvType->getAs<BuiltinType>())
+        Is32Bit = (BT->getKind() == BuiltinType::Float) ||
+                  ((BT->getKind() == BuiltinType::Int ||
+                    BT->getKind() == BuiltinType::UInt) &&
+                   SizeIs32(ConvType));
+      else if (ConvType->isPointerType() || ConvType->isVectorType())
+        Is32Bit = SizeIs32(ConvType);
+
+      if (Is32Bit) {
+        ExprResult ConvResult = SemaRef.PerformImplicitConversion(
+            Arg, ConvType, AssignmentAction::Converting);
+        if (ConvResult.isInvalid())
+          return false;
+
+        TheCall->setArg(ArgIndex, ConvResult.get());
+        return true;
+      }
+    }
+  }
+
+  return false;
+}
+
+/// Handles and coerces a "payload" argument at ArgIndex to a 32-bit-compatible
+/// type. On success, also sets the call result type to the argument's 
resulting
+/// type. Returns true on error (diagnosed), false on success.
+bool handle32BitPayloadArg(Sema &SemaRef, CallExpr *TheCall, unsigned ArgIndex,
+                           StringRef BuiltinName) {
+  ASTContext &AstContext = SemaRef.getASTContext();
+  Expr *Arg = TheCall->getArg(ArgIndex);
+  QualType Type = Arg->getType();
+  QualType Int32Ty = AstContext.IntTy;
+
+  if (Type->isVectorType() || Type->isPointerType()) {
+    uint64_t Size = AstContext.getTypeSize(Type);
+    if (Size > 32) {
+      SemaRef.Diag(Arg->getBeginLoc(),
+                   diag::err_amdgcn_builtin_vector_pointer_arg_size)
+          << BuiltinName << 32 << Type << Size << Arg->getSourceRange();
+      return true;
+    }
+  } else if (Type->isScalarType()) {
+    uint64_t Size = AstContext.getTypeSize(Type);
+    if (Size > 32)
+      SemaRef.Diag(Arg->getBeginLoc(), 
diag::warn_amdgcn_builtin_arg_truncation)
+          << Size << 32 << BuiltinName << Arg->getSourceRange();
+  } else {
+    // Prefer user-defined conversion operators that yield a 32-bit-compatible
+    // type. If none apply, fall back to an implicit int32 conversion.
+    if (!tryUserDefinedConversion32Bit(SemaRef, Arg, TheCall, ArgIndex)) {
+      ExprResult ConvResult = SemaRef.PerformImplicitConversion(
+          Arg, Int32Ty, AssignmentAction::Converting);
+      if (ConvResult.isInvalid())
+        return true;
+      TheCall->setArg(ArgIndex, ConvResult.get());
+    }
+  }
+
+  TheCall->setType(TheCall->getArg(ArgIndex)->getType());
+  return false;
+}
+
+/// Validates and coerces the arguments to __builtin_amdgcn_ds_bpermute.
+/// Ensures arg0 is int32 (with truncation warning as needed) and applies the
+/// 32-bit payload handler to arg1.
+bool checkDsBpermuteFunctionCall(Sema &SemaRef, CallExpr *TheCall) {
+  if (SemaRef.checkArgCount(TheCall, 2))
+    return true;
+
+  ASTContext &AstContext = SemaRef.getASTContext();
+
+  const FunctionDecl *FuncDecl = TheCall->getDirectCallee();
+  StringRef BuiltinName = FuncDecl ? FuncDecl->getName()
+                                   : StringRef("__builtin_amdgcn_ds_bpermute");
+
+  Expr *IndexArg = TheCall->getArg(0);
+  QualType Int32Ty = AstContext.IntTy;
+
+  if (AstContext.getTypeSize(IndexArg->getType()) > 32 &&
+      !IndexArg->getType()->isRecordType())
+    SemaRef.Diag(IndexArg->getBeginLoc(),
+                 diag::warn_amdgcn_builtin_arg_truncation)
+        << AstContext.getTypeSize(IndexArg->getType()) << 32 << BuiltinName
+        << IndexArg->getSourceRange();
+
+  ExprResult ConvResult = SemaRef.PerformImplicitConversion(
+      IndexArg, Int32Ty, AssignmentAction::Converting);
+  if (ConvResult.isInvalid())
+    return true;
+  TheCall->setArg(0, ConvResult.get());
+
+  if (handle32BitPayloadArg(SemaRef, TheCall, /*ArgIndex=*/1, BuiltinName))
+    return true;
+
+  return false;
+}
+
+} // anonymous namespace
+
 namespace clang {
 
 SemaAMDGPU::SemaAMDGPU(Sema &S) : SemaBase(S) {}
@@ -100,6 +226,8 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_cvt_scale_pk16_f32_fp6:
   case AMDGPU::BI__builtin_amdgcn_cvt_scale_pk16_f32_bf6:
     return SemaRef.BuiltinConstantArgRange(TheCall, 2, 0, 7);
+  case AMDGPU::BI__builtin_amdgcn_ds_bpermute:
+    return checkDsBpermuteFunctionCall(SemaRef, TheCall);
   default:
     return false;
   }
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index 907740374dbfe..921f0b037b9f5 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -2078,7 +2078,8 @@ bool Sema::CheckTSBuiltinFunctionCall(const TargetInfo 
&TI, unsigned BuiltinID,
   case llvm::Triple::spirv64:
     if (TI.getTriple().getOS() != llvm::Triple::OSType::AMDHSA)
       return SPIRV().CheckSPIRVBuiltinFunctionCall(TI, BuiltinID, TheCall);
-    return false;
+    else
+      return AMDGPU().CheckAMDGCNBuiltinFunctionCall(BuiltinID, TheCall);
   case llvm::Triple::systemz:
     return SystemZ().CheckSystemZBuiltinFunctionCall(BuiltinID, TheCall);
   case llvm::Triple::x86:
diff --git a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
new file mode 100644
index 0000000000000..00c02ccc7638f
--- /dev/null
+++ b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
@@ -0,0 +1,163 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1200 -aux-triple 
x86_64-pc-linux-gnu \
+// RUN:    -fcuda-is-device -emit-llvm -o - %s | FileCheck %s
+
+typedef short short2 __attribute__((vector_size(4)));
+
+#define __device__ __attribute__((device))
+
+// CHECK-LABEL: define{{.*}}@test_index_i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 %1)
+extern "C" __device__ int test_index_i32(int a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_long
+// CHECK: [[TRUNC:%.*]] = trunc i64 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[TRUNC]], i32 %1)
+extern "C" __device__ int test_index_long(long a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_short
+// CHECK: [[EXT:%.*]] = sext i16 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[EXT]], i32 %1)
+extern "C" __device__ int test_index_short(short a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_float
+// CHECK: [[CONV:%.*]] = fptosi float %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[CONV]], i32 %1)
+extern "C" __device__ int test_index_float(float a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_struct
+// CHECK: [[CALL:%.*]] = call noundef i32 @_ZNK11ConvertiblecviEv(
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[CALL]], i32 %{{[0-9]+}})
+struct Convertible {
+  int value;
+  __device__ operator int() const { return value; }
+};
+
+extern "C" __device__ int test_index_struct(Convertible a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+template<typename T>
+__device__ int test_template(T a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_template_short
+// CHECK: [[EXT:%.*]] = sext i16 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[EXT]], i32 %1)
+extern "C" __device__ int test_template_short(short a, int c) {
+   return test_template<short>(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_float
+// CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: ret float [[RESULT]]
+extern "C" __device__ float test_source_float(int a, float c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_v2i16
+// CHECK: [[BITCAST:%.*]] = bitcast <2 x i16> %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to <2 x i16>
+// CHECK: ret <2 x i16> [[RESULT]]
+extern "C" __device__ short2 test_source_v2i16(int a, short2 c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_float16
+// CHECK: [[BITCAST:%.*]] = bitcast half %1 to i16
+// CHECK: [[ZEXT:%.*]] = zext i16 [[BITCAST]] to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: [[RESULT:%.*]] = bitcast i16 [[TRUNC]] to half
+// CHECK: ret half [[RESULT]]
+extern "C" __device__ _Float16 test_source_float16(int a, _Float16 c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_short
+// CHECK: [[SEXT:%.*]] = sext i16 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[SEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: ret i16 [[TRUNC]]
+extern "C" __device__ short test_source_short(int a, short c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ushort
+// CHECK: [[ZEXT:%.*]] = zext i16 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: ret i16 [[TRUNC]]
+extern "C" __device__ unsigned short test_source_ushort(int a, unsigned short 
c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_long
+// CHECK: [[TRUNC:%.*]] = trunc i64 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[TRUNC]])
+// CHECK: [[SEXT:%.*]] = sext i32 [[CALL]] to i64
+// CHECK: ret i64 [[SEXT]]
+extern "C" __device__ long test_source_long(int a, long c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ulong
+// CHECK: [[TRUNC:%.*]] = trunc i64 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[TRUNC]])
+// CHECK: [[ZEXT:%.*]] = zext i32 [[CALL]] to i64
+// CHECK: ret i64 [[ZEXT]]
+extern "C" __device__ unsigned long test_source_ulong(int a, unsigned long c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_double
+// CHECK: [[FPTRUNC:%.*]] = fptrunc contract double %1 to float
+// CHECK: [[BITCAST:%.*]] = bitcast float [[FPTRUNC]] to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[BITCAST2:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: [[FPEXT:%.*]] = fpext contract float [[BITCAST2]] to double
+// CHECK: ret double [[FPEXT]]
+extern "C" __device__ double test_source_double(int a, double c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_template_float_src
+// CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: ret float [[RESULT]]
+template<typename T>
+__device__ T test_template_src(int a, T c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+extern "C" __device__ float test_template_float_src(int a, float c) {
+   return test_template_src<float>(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_struct_float
+// CHECK: [[CALL:%.*]] = call {{.*}} float @_ZNK16FloatConvertiblecvfEv(
+// CHECK: [[CONV:%.*]] = bitcast float [[CALL]] to i32
+// CHECK: [[RESULT:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[CONV]])
+// CHECK: [[CONV2:%.*]] = bitcast i32 [[RESULT]] to float
+// CHECK: ret float [[CONV2]]
+struct FloatConvertible {
+  float value;
+  __device__ operator float() const { return value; }
+};
+
+extern "C" __device__ float test_source_struct_float(int a, FloatConvertible 
c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
diff --git a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
new file mode 100644
index 0000000000000..937d28a117af8
--- /dev/null
+++ b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
@@ -0,0 +1,33 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -fsyntax-only -verify -triple amdgcn-amd-amdhsa -aux-triple 
x86_64-pc-linux-gnu -Wno-unused-value %s
+// RUN: %clang_cc1 -fsyntax-only -verify -triple x86_64-pc-linux-gnu 
-aux-triple amdgcn -Wno-unused-value %s
+#define __device__ __attribute__((device))
+
+typedef short short2 __attribute__((ext_vector_type(2)));
+typedef int int2 __attribute__((ext_vector_type(2)));
+
+struct A { int a; };
+
+__device__ void test_invalid_num_args(int a, int b, int c) {
+   __builtin_amdgcn_ds_bpermute(a, b, c); // expected-error {{too many 
arguments to function call, expected 2, have 3}}
+}
+
+__device__ void test_invalid_index(short2 a, int b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{converting 'short2' 
(vector of 2 'short' values) to incompatible type 'int'}}
+}
+
+__device__ void test_invalid_vector_src(int a, int2 b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{the vector or 
pointer argument to __builtin_amdgcn_ds_bpermute must have a total size of 32 
bits or less, but type 'int2' (vector of 2 'int' values) has a size of 64 bits}}
+}
+
+__device__ void test_warn_long_src(int a, long b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-warning {{the 64-bit 
argument will be truncated to 32 bits in this call to 
__builtin_amdgcn_ds_bpermute}}
+}
+
+__device__ void test_warn_pointer_src(int a, void* b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{the vector or 
pointer argument to __builtin_amdgcn_ds_bpermute must have a total size of 32 
bits or less, but type 'void *' has a size of 64 bits}}
+}
+
+__device__ void test_invalid_struct_src(int a, A b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{converting 'A' to 
incompatible type 'int'}}
+}
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index cf82f7f06a693..e7e360ca21879 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2754,10 +2754,10 @@ def int_amdgcn_ds_permute :
     [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // llvm.amdgcn.ds.bpermute <index> <src>
-def int_amdgcn_ds_bpermute :
-  ClangBuiltin<"__builtin_amdgcn_ds_bpermute">,
-  Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
-     [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+def int_amdgcn_ds_bpermute
+    : Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], [
+      IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree
+    ]>;
 
 // llvm.amdgcn.perm <src0> <src1> <selector>
 def int_amdgcn_perm :

>From 2b58c0cb5a4d6e28f82b5ba129b8d0df18ab84a2 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Thu, 21 Aug 2025 14:18:38 -0400
Subject: [PATCH 2/7] handle types wider than 32 bit by chunks

---
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp | 291 ++++++++++++++------
 1 file changed, 213 insertions(+), 78 deletions(-)

diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 91a414418f6a6..e61db0bffec47 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -159,119 +159,254 @@ Value *EmitAMDGPUGridSize(CodeGenFunction &CGF, 
unsigned Index) {
                   llvm::MDNode::get(CGF.getLLVMContext(), {}));
   return LD;
 }
-// Lowers __builtin_amdgcn_ds_bpermute to the corresponding LLVM intrinsic with
-// careful bit-level coercions of operands and result to match Clang types.
-llvm::Value *emitAMDGCNDsBpermute(clang::CodeGen::CodeGenFunction &CGF,
-                                  const clang::CallExpr *Call) {
-  auto &Builder = CGF.Builder;
+
+// Lower __builtin_amdgcn_ds_bpermute to llvm.amdgcn.ds.bpermute for arbitrary
+// builtin, vector, and aggregate (struct/array/complex) source types.
+// Assumptions:
+// - Return type equals source type (frontend/Sema should enforce).
+// - Semantics are on the object representation (raw bits), including padding.
+// - For payloads > 32 bits, split into 32-bit words, permute each with the 
same index,
+//   and reassemble.
+// - First-class scalar/vector values whose total size is a multiple of 32 
bits use a
+//   register-only path by bitcasting to <N x i32>. Aggregates or odd sizes 
use a
+//   memory-backed path.
+// - = 32-bit scalars (char/short/int/float/half) follow a fast i32 path for 
performance.
+llvm::Value *
+emitAMDGCNDsBpermute(clang::CodeGen::CodeGenFunction &CGF,
+                     const clang::CallExpr *Call) {
+  auto &B   = CGF.Builder;
   auto &CGM = CGF.CGM;
   const llvm::DataLayout &DL = CGM.getDataLayout();
 
-  llvm::Type *I32Ty = Builder.getInt32Ty();
+  llvm::Type *I8  = B.getInt8Ty();
+  llvm::Type *I32 = B.getInt32Ty();
+  llvm::Type *I64 = B.getInt64Ty();
+
+  auto c32 = [&](uint32_t v) { return llvm::ConstantInt::get(I32, v); };
+  auto c64 = [&](uint64_t v) { return llvm::ConstantInt::get(I64, v); };
 
-  auto GetBitWidth = [&](llvm::Type *Ty) -> unsigned {
+  // Returns the DL-based bit width of a type.
+  auto getBitWidth = [&](llvm::Type *Ty) -> unsigned {
     return DL.getTypeSizeInBits(Ty).getFixedValue();
   };
 
-  // Coerces arbitrary scalar/vector/pointer to i32 by preserving value/bit
-  // semantics where applicable.
-  auto ToI32Bits = [&](llvm::Value *Val, clang::QualType Qt) -> llvm::Value * {
-    llvm::Type *Ty = Val->getType();
+  // Coerces the index to i32 (value semantics).
+  // - Integers: zext/trunc to i32.
+  // - Pointers: ptrtoint to intptr, then zext/trunc to i32.
+  // - Other first-class: bitcast to intN then zext/trunc to i32.
+  auto toI32Index = [&](llvm::Value *IdxVal, clang::QualType IdxQT) -> 
llvm::Value * {
+    (void)IdxQT; // signedness not relevant for index
+    llvm::Type *Ty = IdxVal->getType();
+    if (Ty->isIntegerTy())
+      return B.CreateZExtOrTrunc(IdxVal, I32);
+    if (Ty->isPointerTy()) {
+      unsigned PtrBits = DL.getPointerSizeInBits(Ty->getPointerAddressSpace());
+      return B.CreateZExtOrTrunc(B.CreatePtrToInt(IdxVal, 
B.getIntNTy(PtrBits)), I32);
+    }
+    unsigned Bits = getBitWidth(Ty);
+    return B.CreateZExtOrTrunc(B.CreateBitCast(IdxVal, B.getIntNTy(Bits)), 
I32);
+  };
 
+  // Coerces an arbitrary = 32-bit scalar payload to i32.
+  // - Integers: extend to i32 honoring signedness if narrower; zext/trunc 
otherwise.
+  // - Pointers: ptrtoint to intptr, then zext/trunc to i32.
+  // - Other first-class scalars (e.g., float, half): bitcast to intN then 
zext/trunc to i32.
+  auto coercePayloadToI32 = [&](llvm::Value *Val, clang::QualType SrcQT) -> 
llvm::Value * {
+    llvm::Type *Ty = Val->getType();
     if (Ty->isIntegerTy()) {
-      unsigned BitWidth = Ty->getIntegerBitWidth();
-      if (BitWidth < 32) {
-        if (Qt->isSignedIntegerType())
-          return Builder.CreateSExt(Val, I32Ty);
-        else
-          return Builder.CreateZExt(Val, I32Ty);
-      } else
-        return Builder.CreateZExtOrTrunc(Val, I32Ty);
+      unsigned BW = Ty->getIntegerBitWidth();
+      if (BW < 32) {
+        if (SrcQT->isSignedIntegerType())
+          return B.CreateSExt(Val, I32);
+        return B.CreateZExt(Val, I32);
+      }
+      return B.CreateZExtOrTrunc(Val, I32);
     }
-
     if (Ty->isPointerTy()) {
       unsigned PtrBits = DL.getPointerSizeInBits(Ty->getPointerAddressSpace());
-      llvm::Type *IntPtrTy = Builder.getIntNTy(PtrBits);
-      llvm::Value *AsInt = Builder.CreatePtrToInt(Val, IntPtrTy);
-      return Builder.CreateZExtOrTrunc(AsInt, I32Ty);
+      return B.CreateZExtOrTrunc(B.CreatePtrToInt(Val, B.getIntNTy(PtrBits)), 
I32);
     }
+    unsigned Bits = getBitWidth(Ty);
+    return B.CreateZExtOrTrunc(B.CreateBitCast(Val, B.getIntNTy(Bits)), I32);
+  };
 
-    unsigned Bits = GetBitWidth(Ty);
-    llvm::Type *IntN = Builder.getIntNTy(Bits);
-    llvm::Value *AsInt = Builder.CreateBitCast(Val, IntN);
-    return Builder.CreateZExtOrTrunc(AsInt, I32Ty);
+  // Converts an i32 result back to an arbitrary = 32-bit destination type.
+  // - Integer = 32 bits: zext/sext/trunc appropriately using source 
signedness for narrow types.
+  // - Pointer = 32 bits: zext/trunc to pointer width and inttoptr.
+  // - Other first-class types:
+  //   - If 32 bits: bitcast i32 to destination type.
+  //   - If narrower than 32 bits (e.g., half = 16): first trunc i32 to iN, 
then bitcast iN to DstTy.
+  auto coerceFromI32ToType = [&](llvm::Value *I32Val,
+                                 llvm::Type *DstTy,
+                                 clang::QualType SrcQT) -> llvm::Value * {
+    if (DstTy->isIntegerTy()) {
+      unsigned DW = DstTy->getIntegerBitWidth();
+      if (DW < 32) {
+        if (SrcQT->isSignedIntegerType())
+          return B.CreateTrunc(B.CreateSExt(I32Val, B.getIntNTy(32)), DstTy);
+        return B.CreateTrunc(B.CreateZExt(I32Val, B.getIntNTy(32)), DstTy);
+      }
+      if (DW == 32)
+        return B.CreateZExtOrTrunc(I32Val, DstTy);
+    }
+    if (DstTy->isPointerTy()) {
+      unsigned PW = DL.getPointerSizeInBits(DstTy->getPointerAddressSpace());
+      llvm::Value *AsInt = I32Val;
+      if (PW != 32)
+        AsInt = B.CreateZExtOrTrunc(I32Val, B.getIntNTy(PW));
+      return B.CreateIntToPtr(AsInt, DstTy);
+    }
+    unsigned BW = getBitWidth(DstTy);
+    if (BW == 32)
+      return B.CreateBitCast(I32Val, DstTy);
+    // General non-integer, non-pointer narrower-than-32 case (e.g. half = 16).
+    llvm::Type *IntBW = B.getIntNTy(BW);
+    llvm::Value *Tr = I32Val;
+    if (BW < 32)
+      Tr = B.CreateTrunc(I32Val, IntBW);
+    else if (BW > 32)
+      Tr = B.CreateZExt(I32Val, IntBW); // should not happen in the fast 
32-bit path
+    return B.CreateBitCast(Tr, DstTy);
   };
 
-  // Bit-preserving resize/cast between arbitrary source and destination LLVM
-  // types.
-  auto BitCoerceTo = [&](llvm::Value *Val, llvm::Type *DstTy) -> llvm::Value * 
{
-    llvm::Type *SrcTy = Val->getType();
-    if (SrcTy == DstTy)
-      return Val;
+  // Returns {wordCount, tailBytes} for a payload size in bits.
+  auto wordCountAndTail = [&](unsigned totalBits) -> std::pair<unsigned, 
unsigned> {
+    unsigned bytes = totalBits / 8;
+    return {bytes / 4, bytes % 4};
+  };
 
-    unsigned SrcBits = DL.getTypeSizeInBits(SrcTy).getFixedValue();
-    unsigned DstBits = DL.getTypeSizeInBits(DstTy).getFixedValue();
+  // Index as i32
+  llvm::Value *IndexI32 = toI32Index(CGF.EmitScalarExpr(Call->getArg(0)),
+                                     Call->getArg(0)->getType());
+
+  // Underlying intrinsic
+  llvm::Function *Bperm = 
CGM.getIntrinsic(llvm::Intrinsic::amdgcn_ds_bpermute);
+
+  clang::QualType RetQT = Call->getType();
+  clang::QualType SrcQT = Call->getArg(1)->getType();
+  llvm::Type *RetTy = CGF.ConvertType(RetQT);
+
+  // Check for aggregates (struct/array/complex) from Clang's perspective.
+  bool IsAggregate = RetQT->isAggregateType() || RetQT->isAnyComplexType();
+
+  // Fast path A: = 32-bit scalar payloads (e.g., char/short/int/float/half).
+  // Keep everything in registers.
+  if (!IsAggregate) {
+    llvm::Value *SrcVal = CGF.EmitScalarExpr(Call->getArg(1));
+    unsigned totalBits = getBitWidth(SrcVal->getType());
+    if (totalBits <= 32) {
+      llvm::Value *SrcI32 = coercePayloadToI32(SrcVal, SrcQT);
+      llvm::SmallVector<llvm::Value *, 2> ArgsA{IndexI32, SrcI32};
+      llvm::Value *ResI32 = B.CreateCall(Bperm->getFunctionType(), Bperm, 
ArgsA);
+      llvm::Value *Res    = coerceFromI32ToType(ResI32, RetTy, SrcQT);
+      return Res;
+    }
+  }
 
-    if (SrcTy->isIntegerTy() && DstTy->isIntegerTy())
-      return Builder.CreateZExtOrTrunc(Val, DstTy);
+  // Fast path B: First-class scalar/vector whose total size is a multiple of 
32 bits.
+  // Bitcast to <N x i32>, permute each lane, bitcast back. Register-only; no 
memory.
+  if (!IsAggregate) {
+    llvm::Value *SrcVal = CGF.EmitScalarExpr(Call->getArg(1));
+    unsigned totalBits = getBitWidth(SrcVal->getType());
+    auto [words, tail] = wordCountAndTail(totalBits);
+    if (words > 0 && tail == 0) {
+      llvm::Type *I32VecTy = llvm::FixedVectorType::get(I32, words);
+      llvm::Value *AsI32Vec = B.CreateBitCast(SrcVal, I32VecTy);
+
+      llvm::Value *ResVec = llvm::UndefValue::get(I32VecTy);
+      for (unsigned i = 0; i < words; ++i) {
+        llvm::Value *Lane = B.CreateExtractElement(AsI32Vec, c32(i));
+        llvm::SmallVector<llvm::Value *, 2> ArgsB{IndexI32, Lane};
+        llvm::Value *Perm = B.CreateCall(Bperm->getFunctionType(), Bperm, 
ArgsB);
+        ResVec = B.CreateInsertElement(ResVec, Perm, c32(i));
+      }
 
-    if (SrcBits == DstBits)
-      return Builder.CreateBitCast(Val, DstTy);
+      llvm::Value *Res = B.CreateBitCast(ResVec, RetTy);
+      return Res;
+    }
+  }
 
-    llvm::Type *IntSrcTy = Builder.getIntNTy(SrcBits);
-    llvm::Value *AsInt = Val;
-    if (SrcTy->isPointerTy())
-      AsInt = Builder.CreatePtrToInt(Val, IntSrcTy);
-    else if (!SrcTy->isIntegerTy())
-      AsInt = Builder.CreateBitCast(Val, IntSrcTy);
+  // General aggregate/odd-size path:
+  // - Works for structs/arrays/complex and any total size.
+  // - Materialize source to a temp, process 4-byte words (unaligned 
loads/stores),
+  //   handle tail bytes by packing/unpacking into an i32, and return loaded 
Value*.
+  auto emitAggregatePath = [&]() -> llvm::Value * {
+    clang::QualType SrcQTLocal = Call->getArg(1)->getType();
+    llvm::Type *SrcTy = CGF.ConvertType(SrcQTLocal);
 
-    llvm::Type *IntDstTy = Builder.getIntNTy(DstBits);
-    llvm::Value *Resized = Builder.CreateZExtOrTrunc(AsInt, IntDstTy);
+    clang::CodeGen::Address SrcAddr = CGF.CreateMemTemp(SrcQTLocal, 
"dsbperm.src");
+    clang::CodeGen::Address DstAddr = CGF.CreateMemTemp(RetQT,       
"dsbperm.dst");
 
-    if (DstTy->isPointerTy())
-      return Builder.CreateIntToPtr(Resized, DstTy);
+    CGF.EmitAnyExprToMem(Call->getArg(1), SrcAddr, SrcQTLocal.getQualifiers(), 
/*IsInit*/true);
 
-    return Builder.CreateBitCast(Resized, DstTy);
-  };
+    // i8 views of the buffers (as Address).
+    clang::CodeGen::Address SrcI8Addr = SrcAddr.withElementType(I8);
+    clang::CodeGen::Address DstI8Addr = DstAddr.withElementType(I8);
 
-  llvm::Value *Index = CGF.EmitScalarExpr(Call->getArg(0));
-  llvm::Value *Source = CGF.EmitScalarExpr(Call->getArg(1));
+    auto CU = [&](uint64_t N) { return clang::CharUnits::fromQuantity(N); };
 
-  llvm::Type *ReturnTy = CGF.ConvertType(Call->getType());
+    uint64_t sizeBytes = DL.getTypeAllocSize(SrcTy);
+    uint64_t words     = sizeBytes / 4;
+    uint64_t tail      = sizeBytes % 4;
 
-  llvm::Value *IndexI32 = ToI32Bits(Index, Call->getArg(0)->getType());
+    for (uint64_t i = 0; i < words; ++i) {
+      uint64_t off = i * 4;
 
-  llvm::Value *SourceForIntrinsic;
-  llvm::Type *SourceTy = Source->getType();
+      // Byte GEP, then retag to i32 for word load/store.
+      clang::CodeGen::Address SrcWordI8Addr =
+          B.CreateConstInBoundsByteGEP(SrcI8Addr, CU(off));
+      clang::CodeGen::Address DstWordI8Addr =
+          B.CreateConstInBoundsByteGEP(DstI8Addr, CU(off));
 
-  if (SourceTy->isDoubleTy()) {
-    llvm::Value *AsFloat = Builder.CreateFPTrunc(Source, Builder.getFloatTy());
-    SourceForIntrinsic = Builder.CreateBitCast(AsFloat, I32Ty);
-  } else
-    SourceForIntrinsic = ToI32Bits(Source, Call->getArg(1)->getType());
+      clang::CodeGen::Address SrcWordI32Addr =
+          SrcWordI8Addr.withElementType(I32);
+      clang::CodeGen::Address DstWordI32Addr =
+          DstWordI8Addr.withElementType(I32);
 
-  llvm::Function *IntrinsicFn =
-      CGM.getIntrinsic(llvm::Intrinsic::amdgcn_ds_bpermute);
+      auto *Ld = B.CreateLoad(SrcWordI32Addr);
 
-  llvm::Value *Result =
-      Builder.CreateCall(IntrinsicFn, {IndexI32, SourceForIntrinsic});
+      llvm::SmallVector<llvm::Value *, 2> ArgsWord{IndexI32, Ld};
+      llvm::Value *Perm = B.CreateCall(Bperm->getFunctionType(), Bperm, 
ArgsWord);
 
-  if (ReturnTy->isDoubleTy()) {
-    llvm::Value *AsFloat = Builder.CreateBitCast(Result, Builder.getFloatTy());
-    return Builder.CreateFPExt(AsFloat, ReturnTy);
-  }
+      (void)B.CreateStore(Perm, DstWordI32Addr);
+    }
 
-  if (ReturnTy->isIntegerTy() && ReturnTy->getIntegerBitWidth() > 32) {
-    clang::QualType SourceQt = Call->getArg(1)->getType();
-    if (SourceQt->isSignedIntegerType())
-      return Builder.CreateSExt(Result, ReturnTy);
-    else
-      return Builder.CreateZExt(Result, ReturnTy);
-  }
+    if (tail) {
+      uint64_t off = words * 4;
+
+      llvm::Value *Pack = llvm::ConstantInt::get(I32, 0);
+      for (uint64_t b = 0; b < tail; ++b) {
+        clang::CodeGen::Address ByteAddr =
+            B.CreateConstInBoundsByteGEP(SrcI8Addr, CU(off + b));
+        auto *Lb = B.CreateLoad(ByteAddr);
+
+        llvm::Value *Z = B.CreateZExt(Lb, I32);
+        if (b != 0)
+          Z = B.CreateShl(Z, c32(8 * b));
+        Pack = B.CreateOr(Pack, Z);
+      }
 
-  return BitCoerceTo(Result, ReturnTy);
+      llvm::SmallVector<llvm::Value *, 2> ArgsTail{IndexI32, Pack};
+      llvm::Value *Perm = B.CreateCall(Bperm->getFunctionType(), Bperm, 
ArgsTail);
+
+      for (uint64_t b = 0; b < tail; ++b) {
+        llvm::Value *Byte = B.CreateTrunc(B.CreateLShr(Perm, c32(8 * b)), I8);
+        clang::CodeGen::Address ByteAddr =
+            B.CreateConstInBoundsByteGEP(DstI8Addr, CU(off + b));
+        (void)B.CreateStore(Byte, ByteAddr);
+      }
+    }
+
+    // Load the final result from the destination temporary and return it as a 
Value*.
+    auto *Res = B.CreateLoad(DstAddr);
+    return Res;
+  };
+
+  return emitAggregatePath();
 }
 
+
+
 } // namespace
 
 // Generates the IR for __builtin_read_exec_*.

>From 5a771d54e71b6ed6cb9138ddbb0ac86a4a958dd1 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Thu, 21 Aug 2025 21:29:56 -0400
Subject: [PATCH 3/7] update codegen test

---
 .../CodeGenHIP/builtin-amdgcn-ds-bpermute.hip | 41 ++++++++++++-------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
index 00c02ccc7638f..ff5357310f212 100644
--- a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
+++ b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
@@ -105,30 +105,43 @@ extern "C" __device__ unsigned short 
test_source_ushort(int a, unsigned short c)
 }
 
 // CHECK-LABEL: define{{.*}}@test_source_long
-// CHECK: [[TRUNC:%.*]] = trunc i64 %1 to i32
-// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[TRUNC]])
-// CHECK: [[SEXT:%.*]] = sext i32 [[CALL]] to i64
-// CHECK: ret i64 [[SEXT]]
+// CHECK: [[BC:%.*]] = bitcast i64 {{.*}} to <2 x i32>
+// CHECK: [[LO:%.*]] = extractelement <2 x i32> [[BC]], i32 0
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = insertelement <2 x i32> undef, i32 [[RLO]], i32 0
+// CHECK: [[HI:%.*]] = extractelement <2 x i32> [[BC]], i32 1
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = insertelement <2 x i32> [[V0]], i32 [[RHI]], i32 1
+// CHECK: [[RES:%.*]] = bitcast <2 x i32> [[V1]] to i64
+// CHECK: ret i64 [[RES]]
 extern "C" __device__ long test_source_long(int a, long c) {
    return __builtin_amdgcn_ds_bpermute(a, c);
 }
 
 // CHECK-LABEL: define{{.*}}@test_source_ulong
-// CHECK: [[TRUNC:%.*]] = trunc i64 %1 to i32
-// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[TRUNC]])
-// CHECK: [[ZEXT:%.*]] = zext i32 [[CALL]] to i64
-// CHECK: ret i64 [[ZEXT]]
+// CHECK: [[BC:%.*]] = bitcast i64 {{.*}} to <2 x i32>
+// CHECK: [[LO:%.*]] = extractelement <2 x i32> [[BC]], i32 0
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = insertelement <2 x i32> undef, i32 [[RLO]], i32 0
+// CHECK: [[HI:%.*]] = extractelement <2 x i32> [[BC]], i32 1
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = insertelement <2 x i32> [[V0]], i32 [[RHI]], i32 1
+// CHECK: [[RES:%.*]] = bitcast <2 x i32> [[V1]] to i64
+// CHECK: ret i64 [[RES]]
 extern "C" __device__ unsigned long test_source_ulong(int a, unsigned long c) {
    return __builtin_amdgcn_ds_bpermute(a, c);
 }
 
 // CHECK-LABEL: define{{.*}}@test_source_double
-// CHECK: [[FPTRUNC:%.*]] = fptrunc contract double %1 to float
-// CHECK: [[BITCAST:%.*]] = bitcast float [[FPTRUNC]] to i32
-// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
-// CHECK: [[BITCAST2:%.*]] = bitcast i32 [[CALL]] to float
-// CHECK: [[FPEXT:%.*]] = fpext contract float [[BITCAST2]] to double
-// CHECK: ret double [[FPEXT]]
+// CHECK: [[BC:%.*]] = bitcast double {{.*}} to <2 x i32>
+// CHECK: [[LO:%.*]] = extractelement <2 x i32> [[BC]], i32 0
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = insertelement <2 x i32> undef, i32 [[RLO]], i32 0
+// CHECK: [[HI:%.*]] = extractelement <2 x i32> [[BC]], i32 1
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = insertelement <2 x i32> [[V0]], i32 [[RHI]], i32 1
+// CHECK: [[RES:%.*]] = bitcast <2 x i32> [[V1]] to double
+// CHECK: ret double [[RES]]
 extern "C" __device__ double test_source_double(int a, double c) {
    return __builtin_amdgcn_ds_bpermute(a, c);
 }

>From f2a9f0717ba0ab28ef07dacc94add863e8b4cfdb Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Thu, 21 Aug 2025 21:53:41 -0400
Subject: [PATCH 4/7] remove warning about source arg truncation

---
 clang/lib/Sema/SemaAMDGPU.cpp                     | 7 +------
 clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip | 6 +++++-
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index e29e8e7fbc3d4..5780fe62dc6b9 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -86,12 +86,7 @@ bool handle32BitPayloadArg(Sema &SemaRef, CallExpr *TheCall, 
unsigned ArgIndex,
           << BuiltinName << 32 << Type << Size << Arg->getSourceRange();
       return true;
     }
-  } else if (Type->isScalarType()) {
-    uint64_t Size = AstContext.getTypeSize(Type);
-    if (Size > 32)
-      SemaRef.Diag(Arg->getBeginLoc(), 
diag::warn_amdgcn_builtin_arg_truncation)
-          << Size << 32 << BuiltinName << Arg->getSourceRange();
-  } else {
+  } else if (!Type->isScalarType()) {
     // Prefer user-defined conversion operators that yield a 32-bit-compatible
     // type. If none apply, fall back to an implicit int32 conversion.
     if (!tryUserDefinedConversion32Bit(SemaRef, Arg, TheCall, ArgIndex)) {
diff --git a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
index 937d28a117af8..78912b56f709e 100644
--- a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
+++ b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
@@ -16,12 +16,16 @@ __device__ void test_invalid_index(short2 a, int b) {
    __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{converting 'short2' 
(vector of 2 'short' values) to incompatible type 'int'}}
 }
 
+__device__ void test_warn_long_index(long a, int b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-warning {{the 64-bit 
argument will be truncated to 32 bits in this call to 
__builtin_amdgcn_ds_bpermute}}
+}
+
 __device__ void test_invalid_vector_src(int a, int2 b) {
    __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{the vector or 
pointer argument to __builtin_amdgcn_ds_bpermute must have a total size of 32 
bits or less, but type 'int2' (vector of 2 'int' values) has a size of 64 bits}}
 }
 
 __device__ void test_warn_long_src(int a, long b) {
-   __builtin_amdgcn_ds_bpermute(a, b); // expected-warning {{the 64-bit 
argument will be truncated to 32 bits in this call to 
__builtin_amdgcn_ds_bpermute}}
+   __builtin_amdgcn_ds_bpermute(a, b);
 }
 
 __device__ void test_warn_pointer_src(int a, void* b) {

>From 9d2033b56c5e9a7aa4a460f9d48196516b000cfc Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Thu, 21 Aug 2025 22:37:29 -0400
Subject: [PATCH 5/7] handle 64 bit pointer

---
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   | 19 +++++++++++++++----
 .../CodeGenHIP/builtin-amdgcn-ds-bpermute.hip | 16 ++++++++++++++++
 .../SemaHIP/builtin-amdgcn-ds-bpermute.hip    | 10 +++++-----
 3 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index e61db0bffec47..09e1311c4a023 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -311,17 +311,28 @@ emitAMDGCNDsBpermute(clang::CodeGen::CodeGenFunction &CGF,
     auto [words, tail] = wordCountAndTail(totalBits);
     if (words > 0 && tail == 0) {
       llvm::Type *I32VecTy = llvm::FixedVectorType::get(I32, words);
-      llvm::Value *AsI32Vec = B.CreateBitCast(SrcVal, I32VecTy);
+
+      // Handle pointers by going through intptr first
+      llvm::Value *AsIntN = SrcVal;
+      if (SrcVal->getType()->isPointerTy()) {
+        unsigned PW = 
DL.getPointerSizeInBits(SrcVal->getType()->getPointerAddressSpace());
+        AsIntN = B.CreatePtrToInt(SrcVal, B.getIntNTy(PW));
+      }
+
+      llvm::Value *AsI32Vec = B.CreateBitCast(AsIntN, I32VecTy);
 
       llvm::Value *ResVec = llvm::UndefValue::get(I32VecTy);
       for (unsigned i = 0; i < words; ++i) {
         llvm::Value *Lane = B.CreateExtractElement(AsI32Vec, c32(i));
-        llvm::SmallVector<llvm::Value *, 2> ArgsB{IndexI32, Lane};
-        llvm::Value *Perm = B.CreateCall(Bperm->getFunctionType(), Bperm, 
ArgsB);
+        llvm::Value *Perm = B.CreateCall(Bperm->getFunctionType(), Bperm, 
{IndexI32, Lane});
         ResVec = B.CreateInsertElement(ResVec, Perm, c32(i));
       }
 
-      llvm::Value *Res = B.CreateBitCast(ResVec, RetTy);
+      llvm::Value *ResIntN = B.CreateBitCast(ResVec, AsIntN->getType());
+      llvm::Value *Res = ResIntN;
+      if (RetTy->isPointerTy())
+        Res = B.CreateIntToPtr(ResIntN, RetTy);
+
       return Res;
     }
   }
diff --git a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
index ff5357310f212..f588284bf75db 100644
--- a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
+++ b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
@@ -146,6 +146,22 @@ extern "C" __device__ double test_source_double(int a, 
double c) {
    return __builtin_amdgcn_ds_bpermute(a, c);
 }
 
+// CHECK-LABEL: define{{.*}}@test_source_ptr
+// CHECK: [[P2I:%.*]] = ptrtoint ptr {{.*}} to i64
+// CHECK: [[BC:%.*]] = bitcast i64 [[P2I]] to <2 x i32>
+// CHECK: [[LO:%.*]] = extractelement <2 x i32> [[BC]], i32 0
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = insertelement <2 x i32> undef, i32 [[RLO]], i32 0
+// CHECK: [[HI:%.*]] = extractelement <2 x i32> [[BC]], i32 1
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = insertelement <2 x i32> [[V0]], i32 [[RHI]], i32 1
+// CHECK: [[I64RES:%.*]] = bitcast <2 x i32> [[V1]] to i64
+// CHECK: [[PRES:%.*]] = inttoptr i64 [[I64RES]] to ptr
+// CHECK: ret ptr [[PRES]]
+extern "C" __device__ void* test_source_ptr(int a, void* c) {
+  return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
 // CHECK-LABEL: define{{.*}}@test_template_float_src
 // CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
 // CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
diff --git a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
index 78912b56f709e..9adf23fb9692a 100644
--- a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
+++ b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
@@ -20,16 +20,16 @@ __device__ void test_warn_long_index(long a, int b) {
    __builtin_amdgcn_ds_bpermute(a, b); // expected-warning {{the 64-bit 
argument will be truncated to 32 bits in this call to 
__builtin_amdgcn_ds_bpermute}}
 }
 
-__device__ void test_invalid_vector_src(int a, int2 b) {
-   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{the vector or 
pointer argument to __builtin_amdgcn_ds_bpermute must have a total size of 32 
bits or less, but type 'int2' (vector of 2 'int' values) has a size of 64 bits}}
+__device__ void test_vector_src(int a, int2 b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
 }
 
-__device__ void test_warn_long_src(int a, long b) {
+__device__ void test_long_src(int a, long b) {
    __builtin_amdgcn_ds_bpermute(a, b);
 }
 
-__device__ void test_warn_pointer_src(int a, void* b) {
-   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{the vector or 
pointer argument to __builtin_amdgcn_ds_bpermute must have a total size of 32 
bits or less, but type 'void *' has a size of 64 bits}}
+__device__ void test_pointer_src(int a, void* b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
 }
 
 __device__ void test_invalid_struct_src(int a, A b) {

>From b72bf35f7562f78bab59b539eda282ae78625e18 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Thu, 21 Aug 2025 23:17:20 -0400
Subject: [PATCH 6/7] allow builtin type wider than 32 bit

---
 clang/lib/Sema/SemaAMDGPU.cpp                 | 11 ++--------
 .../CodeGenHIP/builtin-amdgcn-ds-bpermute.hip | 21 +++++++++++++++++++
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index 5780fe62dc6b9..da40a9a68583f 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -78,15 +78,8 @@ bool handle32BitPayloadArg(Sema &SemaRef, CallExpr *TheCall, 
unsigned ArgIndex,
   QualType Type = Arg->getType();
   QualType Int32Ty = AstContext.IntTy;
 
-  if (Type->isVectorType() || Type->isPointerType()) {
-    uint64_t Size = AstContext.getTypeSize(Type);
-    if (Size > 32) {
-      SemaRef.Diag(Arg->getBeginLoc(),
-                   diag::err_amdgcn_builtin_vector_pointer_arg_size)
-          << BuiltinName << 32 << Type << Size << Arg->getSourceRange();
-      return true;
-    }
-  } else if (!Type->isScalarType()) {
+  if (!Type->isVectorType() && !Type->isPointerType()
+       && !Type->isScalarType()) {
     // Prefer user-defined conversion operators that yield a 32-bit-compatible
     // type. If none apply, fall back to an implicit int32 conversion.
     if (!tryUserDefinedConversion32Bit(SemaRef, Arg, TheCall, ArgIndex)) {
diff --git a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
index f588284bf75db..8c297eee8c101 100644
--- a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
+++ b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
@@ -3,6 +3,7 @@
 // RUN:    -fcuda-is-device -emit-llvm -o - %s | FileCheck %s
 
 typedef short short2 __attribute__((vector_size(4)));
+typedef double double2 __attribute__((ext_vector_type(2)));
 
 #define __device__ __attribute__((device))
 
@@ -162,6 +163,26 @@ extern "C" __device__ void* test_source_ptr(int a, void* 
c) {
   return __builtin_amdgcn_ds_bpermute(a, c);
 }
 
+// CHECK-LABEL: define{{.*}}@test_source_double2
+// CHECK: [[BC:%.*]] = bitcast <2 x double> {{.*}} to <4 x i32>
+// CHECK: [[E0:%.*]] = extractelement <4 x i32> [[BC]], i32 0
+// CHECK: [[R0:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[E0]])
+// CHECK: [[V0:%.*]] = insertelement <4 x i32> undef, i32 [[R0]], i32 0
+// CHECK: [[E1:%.*]] = extractelement <4 x i32> [[BC]], i32 1
+// CHECK: [[R1:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[E1]])
+// CHECK: [[V1:%.*]] = insertelement <4 x i32> [[V0]], i32 [[R1]], i32 1
+// CHECK: [[E2:%.*]] = extractelement <4 x i32> [[BC]], i32 2
+// CHECK: [[R2:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[E2]])
+// CHECK: [[V2:%.*]] = insertelement <4 x i32> [[V1]], i32 [[R2]], i32 2
+// CHECK: [[E3:%.*]] = extractelement <4 x i32> [[BC]], i32 3
+// CHECK: [[R3:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[E3]])
+// CHECK: [[V3:%.*]] = insertelement <4 x i32> [[V2]], i32 [[R3]], i32 3
+// CHECK: [[RES:%.*]] = bitcast <4 x i32> [[V3]] to <2 x double>
+// CHECK: ret <2 x double> [[RES]]
+extern "C" __device__ double2 test_source_double2(int a, double2 c) {
+  return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
 // CHECK-LABEL: define{{.*}}@test_template_float_src
 // CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
 // CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])

>From 4a6867340ae00548fc4c690ae390e1f1223d15c0 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Fri, 22 Aug 2025 01:18:20 -0400
Subject: [PATCH 7/7] attempt to support arbitrary aggregate

---
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   |  6 +-
 clang/lib/Sema/SemaAMDGPU.cpp                 | 11 ++--
 .../CodeGenHIP/builtin-amdgcn-ds-bpermute.hip | 64 +++++++++++++++----
 .../SemaHIP/builtin-amdgcn-ds-bpermute.hip    |  2 +-
 4 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 09e1311c4a023..8b7e419a1a602 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -409,7 +409,11 @@ emitAMDGCNDsBpermute(clang::CodeGen::CodeGenFunction &CGF,
     }
 
     // Load the final result from the destination temporary and return it as a 
Value*.
-    auto *Res = B.CreateLoad(DstAddr);
+    llvm::Value *Res = B.CreateLoad(DstAddr);
+    // For aggregates (struct/array/union), ensure determinism by freezing the 
value.
+    // freeze turns any undef/poison in padding into a fixed but arbitrary 
value.
+    if (Res->getType()->isAggregateType())
+      Res = B.CreateFreeze(Res);
     return Res;
   };
 
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index da40a9a68583f..aeddcf4596e66 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -77,7 +77,7 @@ bool handle32BitPayloadArg(Sema &SemaRef, CallExpr *TheCall, 
unsigned ArgIndex,
   Expr *Arg = TheCall->getArg(ArgIndex);
   QualType Type = Arg->getType();
   QualType Int32Ty = AstContext.IntTy;
-
+#if 0
   if (!Type->isVectorType() && !Type->isPointerType()
        && !Type->isScalarType()) {
     // Prefer user-defined conversion operators that yield a 32-bit-compatible
@@ -85,12 +85,13 @@ bool handle32BitPayloadArg(Sema &SemaRef, CallExpr 
*TheCall, unsigned ArgIndex,
     if (!tryUserDefinedConversion32Bit(SemaRef, Arg, TheCall, ArgIndex)) {
       ExprResult ConvResult = SemaRef.PerformImplicitConversion(
           Arg, Int32Ty, AssignmentAction::Converting);
-      if (ConvResult.isInvalid())
-        return true;
-      TheCall->setArg(ArgIndex, ConvResult.get());
+      if (!ConvResult.isInvalid()) {
+        TheCall->setArg(ArgIndex, ConvResult.get());
+        return false;
+      }
     }
   }
-
+#endif
   TheCall->setType(TheCall->getArg(ArgIndex)->getType());
   return false;
 }
diff --git a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
index 8c297eee8c101..f2d4af670796c 100644
--- a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
+++ b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
@@ -5,6 +5,11 @@
 typedef short short2 __attribute__((vector_size(4)));
 typedef double double2 __attribute__((ext_vector_type(2)));
 
+struct Inner { short a; char b; };
+struct Outer { int x; struct Inner y; char z; };
+
+union U { int i; char c; };
+
 #define __device__ __attribute__((device))
 
 // CHECK-LABEL: define{{.*}}@test_index_i32
@@ -197,17 +202,50 @@ extern "C" __device__ float test_template_float_src(int 
a, float c) {
    return test_template_src<float>(a, c);
 }
 
-// CHECK-LABEL: define{{.*}}@test_source_struct_float
-// CHECK: [[CALL:%.*]] = call {{.*}} float @_ZNK16FloatConvertiblecvfEv(
-// CHECK: [[CONV:%.*]] = bitcast float [[CALL]] to i32
-// CHECK: [[RESULT:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[CONV]])
-// CHECK: [[CONV2:%.*]] = bitcast i32 [[RESULT]] to float
-// CHECK: ret float [[CONV2]]
-struct FloatConvertible {
-  float value;
-  __device__ operator float() const { return value; }
-};
-
-extern "C" __device__ float test_source_struct_float(int a, FloatConvertible 
c) {
-   return __builtin_amdgcn_ds_bpermute(a, c);
+// CHECK-LABEL: define{{.*}}@test_source_nested(
+// CHECK: entry:
+// CHECK:   %retval = alloca %struct.Outer, align 4, addrspace(5)
+// CHECK:   %src = alloca %struct.Outer, align 4, addrspace(5)
+// CHECK:   %idx.addr = alloca i32, align 4, addrspace(5)
+// CHECK:   %dsbperm.src = alloca %struct.Outer, align 4, addrspace(5)
+// CHECK:   %dsbperm.dst = alloca %struct.Outer, align 4, addrspace(5)
+// CHECK:   %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
+// CHECK:   %src1 = addrspacecast ptr addrspace(5) %src to ptr
+// CHECK:   %idx.addr.ascast = addrspacecast ptr addrspace(5) %idx.addr to ptr
+// CHECK:   %dsbperm.src.ascast = addrspacecast ptr addrspace(5) %dsbperm.src 
to ptr
+// CHECK:   %dsbperm.dst.ascast = addrspacecast ptr addrspace(5) %dsbperm.dst 
to ptr
+// CHECK:   store i32 %idx, ptr %idx.addr.ascast, align 4
+// CHECK:   %[[IDX:.*]] = load i32, ptr %idx.addr.ascast, align 4
+// CHECK:   call void @llvm.memcpy.p0.p0.i64(ptr align 4 %dsbperm.src.ascast, 
ptr align 4 %src1, i64 12, i1 false)
+
+// First 4-byte word at offset 0
+// CHECK:   %[[SRC0:.*]] = getelementptr inbounds i8, ptr %dsbperm.src.ascast, 
i64 0
+// CHECK:   %[[DST0:.*]] = getelementptr inbounds i8, ptr %dsbperm.dst.ascast, 
i64 0
+// CHECK:   %[[LD0:.*]] = load i32, ptr %[[SRC0]], align 4
+// CHECK:   %[[P0:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LD0]])
+// CHECK:   store i32 %[[P0]], ptr %[[DST0]], align 4
+
+// Second 4-byte word at offset 4
+// CHECK:   %[[SRC1:.*]] = getelementptr inbounds i8, ptr %dsbperm.src.ascast, 
i64 4
+// CHECK:   %[[DST1:.*]] = getelementptr inbounds i8, ptr %dsbperm.dst.ascast, 
i64 4
+// CHECK:   %[[LD1:.*]] = load i32, ptr %[[SRC1]], align 4
+// CHECK:   %[[P1:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LD1]])
+// CHECK:   store i32 %[[P1]], ptr %[[DST1]], align 4
+
+// Third 4-byte word at offset 8 (size is 12 bytes total)
+// CHECK:   %[[SRC2:.*]] = getelementptr inbounds i8, ptr %dsbperm.src.ascast, 
i64 8
+// CHECK:   %[[DST2:.*]] = getelementptr inbounds i8, ptr %dsbperm.dst.ascast, 
i64 8
+// CHECK:   %[[LD2:.*]] = load i32, ptr %[[SRC2]], align 4
+// CHECK:   %[[P2:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LD2]])
+// CHECK:   store i32 %[[P2]], ptr %[[DST2]], align 4
+
+// Load result and return
+// CHECK:   %[[RES:.*]] = load %struct.Outer, ptr %dsbperm.dst.ascast, align 4
+// CHECK:   ret %struct.Outer
+extern "C" __device__ Outer test_source_nested(int idx, Outer src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
+}
+
+extern "C" __device__ U test_source_union(int idx, U src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
 }
diff --git a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
index 9adf23fb9692a..e44245f5ec408 100644
--- a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
+++ b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
@@ -33,5 +33,5 @@ __device__ void test_pointer_src(int a, void* b) {
 }
 
 __device__ void test_invalid_struct_src(int a, A b) {
-   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{converting 'A' to 
incompatible type 'int'}}
+   __builtin_amdgcn_ds_bpermute(a, b);
 }

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types (PR #153501)

Reply via email to