https://github.com/yxsamliu updated 
https://github.com/llvm/llvm-project/pull/153501

>From ed45edfdf6b027e9d2fa873e2242e4f3ff60157c Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun....@amd.com>
Date: Sat, 23 Aug 2025 14:55:50 -0400
Subject: [PATCH] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types

This change makes __builtin_amdgcn_ds_bpermute polymorphic over arbitrary 
trivially copyable types: the source (second) argument may be any such type, 
and the return type matches it. CodeGen treats values as one or more 32-bit 
words, permuting each word (and packing/unpacking any tail bytes) in a 
byte-wise manner; integers honor signedness, pointers use ptrtoint/inttoptr, 
and non-integers are bitcast per DataLayout. The implementation establishes a 
general CodeGen approach for byte-wise lane operations on arbitrary-sized and 
-typed Clang builtins, intended to be reusable for other byte-wise intrinsics.

This patch also introduces a new AMDGPUSupport.rst section documenting Clang's 
AMDGPU-specific builtins, starting with __builtin_amdgcn_ds_bpermute. The goal 
is to provide clear, centralized reference material for developers writing 
performance-critical GPU kernels as demand for AMDGPU builtin documentation 
grows; future updates will extend this section with additional builtins, their 
semantics, usage patterns, and target requirements.
---
 clang/docs/AMDGPUSupport.rst                  |  91 +++++
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   2 +-
 .../clang/Basic/DiagnosticSemaKinds.td        |   4 +
 clang/lib/CodeGen/CGBuiltin.cpp               |  13 +-
 clang/lib/CodeGen/CodeGenFunction.h           |   3 +-
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   | 245 ++++++++++++-
 clang/lib/Sema/SemaAMDGPU.cpp                 |  44 +++
 clang/lib/Sema/SemaChecking.cpp               |   3 +-
 .../CodeGenHIP/builtin-amdgcn-ds-bpermute.hip | 347 ++++++++++++++++++
 .../SemaHIP/builtin-amdgcn-ds-bpermute.hip    |  37 ++
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td      |   8 +-
 11 files changed, 785 insertions(+), 12 deletions(-)
 create mode 100644 clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
 create mode 100644 clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip

diff --git a/clang/docs/AMDGPUSupport.rst b/clang/docs/AMDGPUSupport.rst
index 3eada5f900613..3124d71c13a79 100644
--- a/clang/docs/AMDGPUSupport.rst
+++ b/clang/docs/AMDGPUSupport.rst
@@ -61,3 +61,94 @@ Predefined Macros
      - Defined if FP64 instruction is available (deprecated).
 
 Please note that the specific architecture and feature names will vary 
depending on the GPU. Also, some macros are deprecated and may be removed in 
future releases.
+
+AMDGPU Builtins
+===============
+
+Clang provides a set of builtins to access low-level, AMDGPU-specific hardware
+features directly from C, C++, OpenCL C, and HIP. These builtins often map
+directly to a single machine instruction.
+
+.. _builtin-amdgcn-ds-bpermute:
+
+``__builtin_amdgcn_ds_bpermute``
+--------------------------------
+
+Performs a backward (pull) permutation of values within a wavefront. This 
builtin compiles to the
+``ds_bpermute_b32`` instruction and implements a "read from lane" semantic 
using a **byte-based**
+address.
+
+**Syntax**
+
+.. code-block:: c++
+
+  T __builtin_amdgcn_ds_bpermute(int index, T src);
+
+**Summary**
+
+All active lanes in the current wavefront conceptually place their ``src`` 
payloads into an
+internal cross-lane buffer. Each lane then reads a 32-bit value from that 
buffer at the byte
+offset given by ``index`` and returns it as type ``T``. The exchange uses LDS 
hardware paths
+but does not access user-visible LDS or imply any synchronization.
+
+This builtin is **polymorphic**: the type of ``src`` determines the return 
type.
+
+Availability
+------------
+
+- Targets: AMD GCN3 (gfx8) and newer.
+
+Parameters
+----------
+
+- ``index`` (``int``): Byte offset used to select the source lane. Hardware 
only consumes bits
+  ``[7:2]``. To read the 32-bit value from lane *i*, pass ``i * 4`` as the 
index.
+  Indices that select lanes outside the current wave size or lanes that are 
inactive at the call
+  site yield an unspecified value (commonly zero on current hardware).
+
+- ``src`` (``T``): The value contributed by the current lane. Supported T 
includes scalar
+  integers and floating point types, pointers, vectors, aggregates 
(structs/unions/complex),
+  and trivially copyable C++ classes. The operation is purely byte-wise and 
trivially copies
+  memory representations; it does not invoke or respect C++ copy/move 
constructors or
+  assignment operators. For types whose total size is greater than 32 bits, the
+  value is treated as a sequence of 4-byte words (plus up to 3 tail bytes).
+  Each 4-byte word is independently permuted via ds_bpermute_b32 using the same
+  index. Any tail bytes are packed into a 32-bit word, permuted once, and then
+  unpacked. The result is reassembled and returned as type T.
+
+Semantics and Guarantees
+------------------------
+
+* **Active lane participation**: Only lanes active in the EXEC mask at the 
call site
+  contribute a payload. Reading from an inactive source lane produces an 
unspecified value.
+
+* **Index per lane**: ``index`` may vary across lanes. Only bits ``[7:2]`` are 
used for lane
+  selection. Bits outside this range are ignored by hardware.
+
+* **No synchronization**: The builtin does not synchronize lanes or waves and 
does not
+  order memory operations. It doesn't read or write user-visible LDS.
+
+* **Wave size**: Valid source lanes are ``0 .. warpSize-1`` (use 
``warpSize``/equivalent to
+  query 32 vs 64). Selecting lanes outside that range yields an unspecified 
value.
+
+Examples
+--------
+
+Reverse within a wavefront (handles wave32 or wave64):
+
+.. code-block:: c++
+
+  #include <hip/hip_runtime.h>
+
+  __global__ void wavefront_reverse(float* data, int n) {
+    int tid     = blockIdx.x * blockDim.x + threadIdx.x;
+    if (tid >= n) return;
+
+    int lane    = threadIdx.x % warpSize;            // works for 32 or 64
+    int peer    = (warpSize - 1) - lane;             // reversed lane
+    int offset  = peer * 4;                          // byte address
+
+    float my_val      = data[tid];
+    float reversed    = __builtin_amdgcn_ds_bpermute(offset, my_val);
+    data[tid] = reversed;
+  }
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index f8f55772db8fe..8c1011d7bba61 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -121,7 +121,7 @@ BUILTIN(__builtin_amdgcn_s_decperflevel, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_setprio, "vIs", "n")
 BUILTIN(__builtin_amdgcn_ds_swizzle, "iiIi", "nc")
 BUILTIN(__builtin_amdgcn_ds_permute, "iii", "nc")
-BUILTIN(__builtin_amdgcn_ds_bpermute, "iii", "nc")
+BUILTIN(__builtin_amdgcn_ds_bpermute, "v.", "nct")
 BUILTIN(__builtin_amdgcn_readfirstlane, "ii", "nc")
 BUILTIN(__builtin_amdgcn_readlane, "iii", "nc")
 BUILTIN(__builtin_amdgcn_fmed3f, "ffff", "nc")
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td 
b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 4c98e0f8b04ed..f7e22f0ce954c 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13594,4 +13594,8 @@ def warn_acc_var_referenced_lacks_op
 // AMDGCN builtins diagnostics
 def err_amdgcn_load_lds_size_invalid_value : Error<"invalid size value">;
 def note_amdgcn_load_lds_size_valid_value : Note<"size must be %select{1, 2, 
or 4|1, 2, 4, 12 or 16}0">;
+def warn_amdgcn_builtin_arg_truncation
+    : Warning<"the %0-bit argument will be truncated to %1 bits in this call "
+              "to %2">,
+      InGroup<DiagGroup<"amdgcn-builtin-arg-trunc">>;
 } // end of sema component.
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index d9cc37d123fb4..dc6a3057349de 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -108,7 +108,7 @@ static Value *EmitTargetArchBuiltinExpr(CodeGenFunction 
*CGF,
     return CGF->EmitPPCBuiltinExpr(BuiltinID, E);
   case llvm::Triple::r600:
   case llvm::Triple::amdgcn:
-    return CGF->EmitAMDGPUBuiltinExpr(BuiltinID, E);
+    return CGF->EmitAMDGPUBuiltinExpr(BuiltinID, E, ReturnValue);
   case llvm::Triple::systemz:
     return CGF->EmitSystemZBuiltinExpr(BuiltinID, E);
   case llvm::Triple::nvptx:
@@ -125,7 +125,7 @@ static Value *EmitTargetArchBuiltinExpr(CodeGenFunction 
*CGF,
   case llvm::Triple::spirv32:
   case llvm::Triple::spirv64:
     if (CGF->getTarget().getTriple().getOS() == llvm::Triple::OSType::AMDHSA)
-      return CGF->EmitAMDGPUBuiltinExpr(BuiltinID, E);
+      return CGF->EmitAMDGPUBuiltinExpr(BuiltinID, E, ReturnValue);
     [[fallthrough]];
   case llvm::Triple::spirv:
     return CGF->EmitSPIRVBuiltinExpr(BuiltinID, E);
@@ -6500,7 +6500,8 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl 
GD, unsigned BuiltinID,
   // ReturnValue to be non-null, so that the target-specific emission code can
   // always just emit into it.
   TypeEvaluationKind EvalKind = getEvaluationKind(E->getType());
-  if (EvalKind == TEK_Aggregate && ReturnValue.isNull()) {
+  if ((EvalKind == TEK_Aggregate || EvalKind == TEK_Complex) &&
+      ReturnValue.isNull()) {
     Address DestPtr = CreateMemTemp(E->getType(), "agg.tmp");
     ReturnValue = ReturnValueSlot(DestPtr, false);
   }
@@ -6516,7 +6517,11 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl 
GD, unsigned BuiltinID,
       return RValue::getAggregate(ReturnValue.getAddress(),
                                   ReturnValue.isVolatile());
     case TEK_Complex:
-      llvm_unreachable("No current target builtin returns complex");
+      // Build an LValue for the provided return slot and load the complex
+      // result.
+      LValue LV = MakeAddrLValue(ReturnValue.getAddress(), E->getType());
+      ComplexPairTy C = EmitLoadOfComplex(LV, E->getExprLoc());
+      return RValue::getComplex(C);
     }
     llvm_unreachable("Bad evaluation kind in EmitBuiltinExpr");
   }
diff --git a/clang/lib/CodeGen/CodeGenFunction.h 
b/clang/lib/CodeGen/CodeGenFunction.h
index fc65199a0f154..327b03804a2ef 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -4850,7 +4850,8 @@ class CodeGenFunction : public CodeGenTypeCache {
   llvm::Value *BuildVector(ArrayRef<llvm::Value *> Ops);
   llvm::Value *EmitX86BuiltinExpr(unsigned BuiltinID, const CallExpr *E);
   llvm::Value *EmitPPCBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
-  llvm::Value *EmitAMDGPUBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
+  llvm::Value *EmitAMDGPUBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
+                                     ReturnValueSlot ReturnValue);
   llvm::Value *EmitHLSLBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
                                    ReturnValueSlot ReturnValue);
 
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index dad1f95ac710d..809cd1dee8ef8 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -159,6 +159,244 @@ Value *EmitAMDGPUGridSize(CodeGenFunction &CGF, unsigned 
Index) {
                   llvm::MDNode::get(CGF.getLLVMContext(), {}));
   return LD;
 }
+
+// Emits LLVM IR to lower a generic AMDGCN ds_bpermute over arbitrary payload
+// types. Assumes DataLayout is accurate; index is coerced to i32; payload is
+// split/coerced to 32-bit words.
+llvm::Value *emitAMDGCNDsBpermute(clang::CodeGen::CodeGenFunction &CGF,
+                                  const clang::CallExpr *Call,
+                                  ReturnValueSlot Dest) {
+  auto &B = CGF.Builder;
+  auto &CGM = CGF.CGM;
+  const llvm::DataLayout &DL = CGM.getDataLayout();
+
+  llvm::Type *I8 = B.getInt8Ty();
+  llvm::Type *I32 = B.getInt32Ty();
+  llvm::Type *I64 = B.getInt64Ty();
+
+  auto C32 = [&](uint32_t V) { return llvm::ConstantInt::get(I32, V); };
+  auto C64 = [&](uint64_t V) { return llvm::ConstantInt::get(I64, V); };
+
+  // Size/bitwidth and coercion helpers for arbitrary first-class types.
+  auto GetBitWidth = [&](llvm::Type *Ty) -> unsigned {
+    return DL.getTypeSizeInBits(Ty).getFixedValue();
+  };
+
+  auto ToI32Index = [&](llvm::Value *IdxVal,
+                        clang::QualType IdxQT) -> llvm::Value * {
+    (void)IdxQT;
+    llvm::Type *Ty = IdxVal->getType();
+    if (Ty->isIntegerTy())
+      return B.CreateZExtOrTrunc(IdxVal, I32);
+    if (Ty->isPointerTy()) {
+      unsigned PtrBits = DL.getPointerSizeInBits(Ty->getPointerAddressSpace());
+      return B.CreateZExtOrTrunc(B.CreatePtrToInt(IdxVal, 
B.getIntNTy(PtrBits)),
+                                 I32);
+    }
+    unsigned Bits = GetBitWidth(Ty);
+    return B.CreateZExtOrTrunc(B.CreateBitCast(IdxVal, B.getIntNTy(Bits)), 
I32);
+  };
+
+  auto CoercePayloadToI32 = [&](llvm::Value *Val,
+                                clang::QualType SrcQT) -> llvm::Value * {
+    llvm::Type *Ty = Val->getType();
+    if (Ty->isIntegerTy()) {
+      unsigned BW = Ty->getIntegerBitWidth();
+      if (BW < 32) {
+        if (SrcQT->isSignedIntegerType())
+          return B.CreateSExt(Val, I32);
+        return B.CreateZExt(Val, I32);
+      }
+      return B.CreateZExtOrTrunc(Val, I32);
+    }
+    if (Ty->isPointerTy()) {
+      unsigned PtrBits = DL.getPointerSizeInBits(Ty->getPointerAddressSpace());
+      return B.CreateZExtOrTrunc(B.CreatePtrToInt(Val, B.getIntNTy(PtrBits)),
+                                 I32);
+    }
+    unsigned Bits = GetBitWidth(Ty);
+    return B.CreateZExtOrTrunc(B.CreateBitCast(Val, B.getIntNTy(Bits)), I32);
+  };
+
+  auto CoerceFromI32ToType = [&](llvm::Value *I32Val, llvm::Type *DstTy,
+                                 clang::QualType SrcQT) -> llvm::Value * {
+    if (DstTy->isIntegerTy()) {
+      unsigned DW = DstTy->getIntegerBitWidth();
+      if (DW < 32) {
+        if (SrcQT->isSignedIntegerType())
+          return B.CreateTrunc(B.CreateSExt(I32Val, B.getIntNTy(32)), DstTy);
+        return B.CreateTrunc(B.CreateZExt(I32Val, B.getIntNTy(32)), DstTy);
+      }
+      if (DW == 32)
+        return B.CreateZExtOrTrunc(I32Val, DstTy);
+    }
+    if (DstTy->isPointerTy()) {
+      unsigned PW = DL.getPointerSizeInBits(DstTy->getPointerAddressSpace());
+      llvm::Value *AsInt = I32Val;
+      if (PW != 32)
+        AsInt = B.CreateZExtOrTrunc(I32Val, B.getIntNTy(PW));
+      return B.CreateIntToPtr(AsInt, DstTy);
+    }
+    unsigned BW = GetBitWidth(DstTy);
+    if (BW == 32)
+      return B.CreateBitCast(I32Val, DstTy);
+    llvm::Type *IntBW = B.getIntNTy(BW);
+    llvm::Value *Tr = I32Val;
+    if (BW < 32)
+      Tr = B.CreateTrunc(I32Val, IntBW);
+    else if (BW > 32)
+      Tr = B.CreateZExt(I32Val, IntBW);
+    return B.CreateBitCast(Tr, DstTy);
+  };
+
+  auto WordCountAndTail =
+      [&](unsigned TotalBits) -> std::pair<unsigned, unsigned> {
+    unsigned Bytes = TotalBits / 8;
+    return {Bytes / 4, Bytes % 4};
+  };
+
+  llvm::Value *IndexI32 = ToI32Index(CGF.EmitScalarExpr(Call->getArg(0)),
+                                     Call->getArg(0)->getType());
+
+  llvm::Function *Bperm = 
CGM.getIntrinsic(llvm::Intrinsic::amdgcn_ds_bpermute);
+
+  clang::QualType RetQT = Call->getType();
+  clang::QualType SrcQT = Call->getArg(1)->getType();
+  llvm::Type *RetTy = CGF.ConvertType(RetQT);
+
+  bool IsAggregate = RetQT->isAggregateType() || RetQT->isAnyComplexType();
+
+  // Fast path: <=32-bit scalar payloads kept entirely in registers.
+  if (!IsAggregate) {
+    llvm::Value *SrcVal = CGF.EmitScalarExpr(Call->getArg(1));
+    unsigned TotalBits = GetBitWidth(SrcVal->getType());
+    if (TotalBits <= 32) {
+      llvm::Value *SrcI32 = CoercePayloadToI32(SrcVal, SrcQT);
+      llvm::SmallVector<llvm::Value *, 2> ArgsA{IndexI32, SrcI32};
+      llvm::Value *ResI32 =
+          B.CreateCall(Bperm->getFunctionType(), Bperm, ArgsA);
+      llvm::Value *Res = CoerceFromI32ToType(ResI32, RetTy, SrcQT);
+      return Res;
+    }
+  }
+
+  // Fast path: non-aggregate with size being a multiple of 32 bits; bitcast to
+  // <N x i32> and permute per word.
+  if (!IsAggregate) {
+    llvm::Value *SrcVal = CGF.EmitScalarExpr(Call->getArg(1));
+    unsigned TotalBits = GetBitWidth(SrcVal->getType());
+    auto [Words, Tail] = WordCountAndTail(TotalBits);
+    if (Words > 0 && Tail == 0) {
+      llvm::Type *I32VecTy = llvm::FixedVectorType::get(I32, Words);
+
+      llvm::Value *AsIntN = SrcVal;
+      if (SrcVal->getType()->isPointerTy()) {
+        unsigned PW = DL.getPointerSizeInBits(
+            SrcVal->getType()->getPointerAddressSpace());
+        AsIntN = B.CreatePtrToInt(SrcVal, B.getIntNTy(PW));
+      }
+
+      llvm::Value *AsI32Vec = B.CreateBitCast(AsIntN, I32VecTy);
+
+      llvm::Value *ResVec = llvm::UndefValue::get(I32VecTy);
+      for (unsigned WordIndex = 0; WordIndex < Words; ++WordIndex) {
+        llvm::Value *Lane = B.CreateExtractElement(AsI32Vec, C32(WordIndex));
+        llvm::Value *Perm =
+            B.CreateCall(Bperm->getFunctionType(), Bperm, {IndexI32, Lane});
+        ResVec = B.CreateInsertElement(ResVec, Perm, C32(WordIndex));
+      }
+
+      llvm::Value *ResIntN = B.CreateBitCast(ResVec, AsIntN->getType());
+      llvm::Value *Res = ResIntN;
+      if (RetTy->isPointerTy())
+        Res = B.CreateIntToPtr(ResIntN, RetTy);
+
+      return Res;
+    }
+  }
+
+  // General path: handle aggregates or odd sizes by materializing to memory,
+  // permuting 4-byte words, and packing/unpacking tail bytes.
+  auto EmitAggregatePath = [&]() -> llvm::Value * {
+    clang::QualType SrcQTLocal = Call->getArg(1)->getType();
+    llvm::Type *SrcTy = CGF.ConvertType(SrcQTLocal);
+
+    clang::CodeGen::Address SrcAddr =
+        CGF.CreateMemTemp(SrcQTLocal, "dsbperm.src");
+    CGF.EmitAnyExprToMem(Call->getArg(1), SrcAddr, SrcQTLocal.getQualifiers(),
+                         true);
+
+    clang::CodeGen::Address DestAddr = Dest.getAddress();
+
+    clang::CodeGen::Address SrcI8Addr = SrcAddr.withElementType(I8);
+    clang::CodeGen::Address DstI8Addr = DestAddr.withElementType(I8);
+
+    auto CU = [&](uint64_t N) { return clang::CharUnits::fromQuantity(N); };
+
+    uint64_t SizeBytes = DL.getTypeAllocSize(SrcTy);
+    uint64_t Words = SizeBytes / 4;
+    uint64_t Tail = SizeBytes % 4;
+
+    for (uint64_t WordIndex = 0; WordIndex < Words; ++WordIndex) {
+      uint64_t Off = WordIndex * 4;
+
+      clang::CodeGen::Address SrcWordI8Addr =
+          B.CreateConstInBoundsByteGEP(SrcI8Addr, CU(Off));
+      clang::CodeGen::Address DstWordI8Addr =
+          B.CreateConstInBoundsByteGEP(DstI8Addr, CU(Off));
+
+      clang::CodeGen::Address SrcWordI32Addr =
+          SrcWordI8Addr.withElementType(I32);
+      clang::CodeGen::Address DstWordI32Addr =
+          DstWordI8Addr.withElementType(I32);
+
+      auto *Ld = B.CreateLoad(SrcWordI32Addr);
+
+      llvm::SmallVector<llvm::Value *, 2> ArgsWord{IndexI32, Ld};
+      llvm::Value *Perm =
+          B.CreateCall(Bperm->getFunctionType(), Bperm, ArgsWord);
+
+      auto *St = B.CreateStore(Perm, DstWordI32Addr);
+      if (Dest.isVolatile())
+        St->setVolatile(true);
+    }
+
+    if (Tail) {
+      uint64_t Off = Words * 4;
+
+      llvm::Value *Pack = llvm::ConstantInt::get(I32, 0);
+      for (uint64_t ByteIndex = 0; ByteIndex < Tail; ++ByteIndex) {
+        clang::CodeGen::Address ByteAddr =
+            B.CreateConstInBoundsByteGEP(SrcI8Addr, CU(Off + ByteIndex));
+        auto *Lb = B.CreateLoad(ByteAddr);
+
+        llvm::Value *Z = B.CreateZExt(Lb, I32);
+        if (ByteIndex != 0)
+          Z = B.CreateShl(Z, C32(8 * ByteIndex));
+        Pack = B.CreateOr(Pack, Z);
+      }
+
+      llvm::SmallVector<llvm::Value *, 2> ArgsTail{IndexI32, Pack};
+      llvm::Value *Perm =
+          B.CreateCall(Bperm->getFunctionType(), Bperm, ArgsTail);
+
+      for (uint64_t ByteIndex = 0; ByteIndex < Tail; ++ByteIndex) {
+        llvm::Value *Byte =
+            B.CreateTrunc(B.CreateLShr(Perm, C32(8 * ByteIndex)), I8);
+        clang::CodeGen::Address ByteAddr =
+            B.CreateConstInBoundsByteGEP(DstI8Addr, CU(Off + ByteIndex));
+        auto *St = B.CreateStore(Byte, ByteAddr);
+        if (Dest.isVolatile())
+          St->setVolatile(true);
+      }
+    }
+
+    return CGF.Builder.getTrue();
+  };
+
+  return EmitAggregatePath();
+}
+
 } // namespace
 
 // Generates the IR for __builtin_read_exec_*.
@@ -296,7 +534,8 @@ void 
CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
 }
 
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
-                                              const CallExpr *E) {
+                                              const CallExpr *E,
+                                              ReturnValueSlot ReturnValue) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
   llvm::SyncScope::ID SSID;
   switch (BuiltinID) {
@@ -341,6 +580,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_ds_swizzle:
     return emitBuiltinWithOneOverloadedType<2>(*this, E,
                                                Intrinsic::amdgcn_ds_swizzle);
+
+  case AMDGPU::BI__builtin_amdgcn_ds_bpermute:
+    return emitAMDGCNDsBpermute(*this, E, ReturnValue);
+
   case AMDGPU::BI__builtin_amdgcn_mov_dpp8:
   case AMDGPU::BI__builtin_amdgcn_mov_dpp:
   case AMDGPU::BI__builtin_amdgcn_update_dpp: {
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index 1913bb830ccd0..cc7d991c5b929 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -18,6 +18,48 @@
 #include "llvm/Support/AtomicOrdering.h"
 #include <cstdint>
 
+namespace {
+
+using llvm::StringRef;
+using namespace clang;
+
+/// Validates and coerces the arguments to __builtin_amdgcn_ds_bpermute.
+/// Ensures the first argument (index) is int32 (with truncation warning as
+/// needed), and set the return type to be the same as the second argument
+/// (source).
+bool checkDsBpermuteFunctionCall(Sema &SemaRef, CallExpr *TheCall) {
+  if (SemaRef.checkArgCount(TheCall, 2))
+    return true;
+
+  ASTContext &AstContext = SemaRef.getASTContext();
+
+  const FunctionDecl *FuncDecl = TheCall->getDirectCallee();
+  StringRef BuiltinName = FuncDecl ? FuncDecl->getName()
+                                   : StringRef("__builtin_amdgcn_ds_bpermute");
+
+  Expr *IndexArg = TheCall->getArg(0);
+  QualType Int32Ty = AstContext.IntTy;
+
+  if (AstContext.getTypeSize(IndexArg->getType()) > 32 &&
+      !IndexArg->getType()->isRecordType())
+    SemaRef.Diag(IndexArg->getBeginLoc(),
+                 diag::warn_amdgcn_builtin_arg_truncation)
+        << AstContext.getTypeSize(IndexArg->getType()) << 32 << BuiltinName
+        << IndexArg->getSourceRange();
+
+  ExprResult ConvResult = SemaRef.PerformImplicitConversion(
+      IndexArg, Int32Ty, AssignmentAction::Converting);
+  if (ConvResult.isInvalid())
+    return true;
+
+  TheCall->setArg(0, ConvResult.get());
+  TheCall->setType(TheCall->getArg(1)->getType());
+
+  return false;
+}
+
+} // anonymous namespace
+
 namespace clang {
 
 SemaAMDGPU::SemaAMDGPU(Sema &S) : SemaBase(S) {}
@@ -100,6 +142,8 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_cvt_scale_pk16_f32_fp6:
   case AMDGPU::BI__builtin_amdgcn_cvt_scale_pk16_f32_bf6:
     return SemaRef.BuiltinConstantArgRange(TheCall, 2, 0, 7);
+  case AMDGPU::BI__builtin_amdgcn_ds_bpermute:
+    return checkDsBpermuteFunctionCall(SemaRef, TheCall);
   default:
     return false;
   }
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index 6e777fb9aec8e..4794e2dd14c14 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -2079,7 +2079,8 @@ bool Sema::CheckTSBuiltinFunctionCall(const TargetInfo 
&TI, unsigned BuiltinID,
   case llvm::Triple::spirv64:
     if (TI.getTriple().getOS() != llvm::Triple::OSType::AMDHSA)
       return SPIRV().CheckSPIRVBuiltinFunctionCall(TI, BuiltinID, TheCall);
-    return false;
+    else
+      return AMDGPU().CheckAMDGCNBuiltinFunctionCall(BuiltinID, TheCall);
   case llvm::Triple::systemz:
     return SystemZ().CheckSystemZBuiltinFunctionCall(BuiltinID, TheCall);
   case llvm::Triple::x86:
diff --git a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
new file mode 100644
index 0000000000000..97b722e46b439
--- /dev/null
+++ b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
@@ -0,0 +1,347 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1200 -aux-triple 
x86_64-pc-linux-gnu \
+// RUN:    -fcuda-is-device -emit-llvm -o - %s | FileCheck %s
+
+typedef short short2 __attribute__((vector_size(4)));
+typedef double double2 __attribute__((ext_vector_type(2)));
+
+struct Inner { short a; char b; };
+struct Outer { int x; struct Inner y; char z; };
+
+union U { int i; char c; };
+
+#define __device__ __attribute__((device))
+
+// CHECK-LABEL: define{{.*}}@test_index_i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 %1)
+extern "C" __device__ int test_index_i32(int a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_long
+// CHECK: [[TRUNC:%.*]] = trunc i64 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[TRUNC]], i32 %1)
+extern "C" __device__ int test_index_long(long a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_short
+// CHECK: [[EXT:%.*]] = sext i16 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[EXT]], i32 %1)
+extern "C" __device__ int test_index_short(short a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_float
+// CHECK: [[CONV:%.*]] = fptosi float %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[CONV]], i32 %1)
+extern "C" __device__ int test_index_float(float a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_struct
+// CHECK: [[CALL:%.*]] = call noundef i32 @_ZNK11ConvertiblecviEv(
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[CALL]], i32 %{{[0-9]+}})
+struct Convertible {
+  int value;
+  __device__ operator int() const { return value; }
+};
+
+extern "C" __device__ int test_index_struct(Convertible a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+template<typename T>
+__device__ int test_template(T a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_template_short
+// CHECK: [[EXT:%.*]] = sext i16 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[EXT]], i32 %1)
+extern "C" __device__ int test_template_short(short a, int c) {
+   return test_template<short>(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_float
+// CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: ret float [[RESULT]]
+extern "C" __device__ float test_source_float(int a, float c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_v2i16
+// CHECK: [[BITCAST:%.*]] = bitcast <2 x i16> %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to <2 x i16>
+// CHECK: ret <2 x i16> [[RESULT]]
+extern "C" __device__ short2 test_source_v2i16(int a, short2 c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_float16
+// CHECK: [[BITCAST:%.*]] = bitcast half %1 to i16
+// CHECK: [[ZEXT:%.*]] = zext i16 [[BITCAST]] to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: [[RESULT:%.*]] = bitcast i16 [[TRUNC]] to half
+// CHECK: ret half [[RESULT]]
+extern "C" __device__ _Float16 test_source_float16(int a, _Float16 c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_short
+// CHECK: [[SEXT:%.*]] = sext i16 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[SEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: ret i16 [[TRUNC]]
+extern "C" __device__ short test_source_short(int a, short c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ushort
+// CHECK: [[ZEXT:%.*]] = zext i16 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: ret i16 [[TRUNC]]
+extern "C" __device__ unsigned short test_source_ushort(int a, unsigned short 
c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_long
+// CHECK: [[BC:%.*]] = bitcast i64 {{.*}} to <2 x i32>
+// CHECK: [[LO:%.*]] = extractelement <2 x i32> [[BC]], i32 0
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = insertelement <2 x i32> undef, i32 [[RLO]], i32 0
+// CHECK: [[HI:%.*]] = extractelement <2 x i32> [[BC]], i32 1
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = insertelement <2 x i32> [[V0]], i32 [[RHI]], i32 1
+// CHECK: [[RES:%.*]] = bitcast <2 x i32> [[V1]] to i64
+// CHECK: ret i64 [[RES]]
+extern "C" __device__ long test_source_long(int a, long c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ulong
+// CHECK: [[BC:%.*]] = bitcast i64 {{.*}} to <2 x i32>
+// CHECK: [[LO:%.*]] = extractelement <2 x i32> [[BC]], i32 0
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = insertelement <2 x i32> undef, i32 [[RLO]], i32 0
+// CHECK: [[HI:%.*]] = extractelement <2 x i32> [[BC]], i32 1
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = insertelement <2 x i32> [[V0]], i32 [[RHI]], i32 1
+// CHECK: [[RES:%.*]] = bitcast <2 x i32> [[V1]] to i64
+// CHECK: ret i64 [[RES]]
+extern "C" __device__ unsigned long test_source_ulong(int a, unsigned long c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_double
+// CHECK: [[BC:%.*]] = bitcast double {{.*}} to <2 x i32>
+// CHECK: [[LO:%.*]] = extractelement <2 x i32> [[BC]], i32 0
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = insertelement <2 x i32> undef, i32 [[RLO]], i32 0
+// CHECK: [[HI:%.*]] = extractelement <2 x i32> [[BC]], i32 1
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = insertelement <2 x i32> [[V0]], i32 [[RHI]], i32 1
+// CHECK: [[RES:%.*]] = bitcast <2 x i32> [[V1]] to double
+// CHECK: ret double [[RES]]
+extern "C" __device__ double test_source_double(int a, double c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ptr
+// CHECK: [[P2I:%.*]] = ptrtoint ptr {{.*}} to i64
+// CHECK: [[BC:%.*]] = bitcast i64 [[P2I]] to <2 x i32>
+// CHECK: [[LO:%.*]] = extractelement <2 x i32> [[BC]], i32 0
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = insertelement <2 x i32> undef, i32 [[RLO]], i32 0
+// CHECK: [[HI:%.*]] = extractelement <2 x i32> [[BC]], i32 1
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = insertelement <2 x i32> [[V0]], i32 [[RHI]], i32 1
+// CHECK: [[I64RES:%.*]] = bitcast <2 x i32> [[V1]] to i64
+// CHECK: [[PRES:%.*]] = inttoptr i64 [[I64RES]] to ptr
+// CHECK: ret ptr [[PRES]]
+extern "C" __device__ void* test_source_ptr(int a, void* c) {
+  return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_double2
+// CHECK: [[BC:%.*]] = bitcast <2 x double> {{.*}} to <4 x i32>
+// CHECK: [[E0:%.*]] = extractelement <4 x i32> [[BC]], i32 0
+// CHECK: [[R0:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[E0]])
+// CHECK: [[V0:%.*]] = insertelement <4 x i32> undef, i32 [[R0]], i32 0
+// CHECK: [[E1:%.*]] = extractelement <4 x i32> [[BC]], i32 1
+// CHECK: [[R1:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[E1]])
+// CHECK: [[V1:%.*]] = insertelement <4 x i32> [[V0]], i32 [[R1]], i32 1
+// CHECK: [[E2:%.*]] = extractelement <4 x i32> [[BC]], i32 2
+// CHECK: [[R2:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[E2]])
+// CHECK: [[V2:%.*]] = insertelement <4 x i32> [[V1]], i32 [[R2]], i32 2
+// CHECK: [[E3:%.*]] = extractelement <4 x i32> [[BC]], i32 3
+// CHECK: [[R3:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[E3]])
+// CHECK: [[V3:%.*]] = insertelement <4 x i32> [[V2]], i32 [[R3]], i32 3
+// CHECK: [[RES:%.*]] = bitcast <4 x i32> [[V3]] to <2 x double>
+// CHECK: ret <2 x double> [[RES]]
+extern "C" __device__ double2 test_source_double2(int a, double2 c) {
+  return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_template_float_src
+// CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: ret float [[RESULT]]
+template<typename T>
+__device__ T test_template_src(int a, T c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+extern "C" __device__ float test_template_float_src(int a, float c) {
+   return test_template_src<float>(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_nested(
+// CHECK: entry:
+// CHECK:   %retval = alloca %struct.Outer, align 4, addrspace(5)
+// CHECK:   %src = alloca %struct.Outer, align 4, addrspace(5)
+// CHECK:   %idx.addr = alloca i32, align 4, addrspace(5)
+// CHECK:   %dsbperm.src = alloca %struct.Outer, align 4, addrspace(5)
+// CHECK:   %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
+// CHECK:   %src1 = addrspacecast ptr addrspace(5) %src to ptr
+// CHECK:   %idx.addr.ascast = addrspacecast ptr addrspace(5) %idx.addr to ptr
+// CHECK:   %dsbperm.src.ascast = addrspacecast ptr addrspace(5) %dsbperm.src 
to ptr
+// Materialize src aggregate from coerced pieces
+// CHECK:   %[[SRC0_GEP:.*]] = getelementptr inbounds nuw %struct.Outer, ptr 
%src1, i32 0, i32 0
+// CHECK:   store i32 %src.coerce0, ptr %[[SRC0_GEP]], align 4
+// CHECK:   %[[SRC1_GEP:.*]] = getelementptr inbounds nuw %struct.Outer, ptr 
%src1, i32 0, i32 1
+// CHECK:   store %struct.Inner %src.coerce1, ptr %[[SRC1_GEP]], align 4
+// CHECK:   %[[SRC2_GEP:.*]] = getelementptr inbounds nuw %struct.Outer, ptr 
%src1, i32 0, i32 2
+// CHECK:   store i8 %src.coerce2, ptr %[[SRC2_GEP]], align 4
+// CHECK:   store i32 %idx, ptr %idx.addr.ascast, align 4
+// CHECK:   %[[IDX:.*]] = load i32, ptr %idx.addr.ascast, align 4
+// Forward src bytes to source buffer
+// CHECK:   call void @llvm.memcpy.p0.p0.i64(ptr align 4 %dsbperm.src.ascast, 
ptr align 4 %src1, i64 12, i1 false)
+
+// First 4-byte word at offset 0 -> write directly to retval
+// CHECK:   %[[SRC0:.*]] = getelementptr inbounds i8, ptr %dsbperm.src.ascast, 
i64 0
+// CHECK:   %[[DST0:.*]] = getelementptr inbounds i8, ptr %retval.ascast, i64 0
+// CHECK:   %[[LD0:.*]] = load i32, ptr %[[SRC0]], align 4
+// CHECK:   %[[P0:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LD0]])
+// CHECK:   store i32 %[[P0]], ptr %[[DST0]], align 4
+
+// Second 4-byte word at offset 4
+// CHECK:   %[[SRC1B:.*]] = getelementptr inbounds i8, ptr 
%dsbperm.src.ascast, i64 4
+// CHECK:   %[[DST1:.*]] = getelementptr inbounds i8, ptr %retval.ascast, i64 4
+// CHECK:   %[[LD1:.*]] = load i32, ptr %[[SRC1B]], align 4
+// CHECK:   %[[P1:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LD1]])
+// CHECK:   store i32 %[[P1]], ptr %[[DST1]], align 4
+
+// Third 4-byte word at offset 8 (size is 12 bytes total)
+// CHECK:   %[[SRC2B:.*]] = getelementptr inbounds i8, ptr 
%dsbperm.src.ascast, i64 8
+// CHECK:   %[[DST2:.*]] = getelementptr inbounds i8, ptr %retval.ascast, i64 8
+// CHECK:   %[[LD2:.*]] = load i32, ptr %[[SRC2B]], align 4
+// CHECK:   %[[P2:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LD2]])
+// CHECK:   store i32 %[[P2]], ptr %[[DST2]], align 4
+
+// Return the aggregate from retval.ascast
+// CHECK:   %[[RES:.*]] = load %struct.Outer, ptr %retval.ascast, align 4
+// CHECK:   ret %struct.Outer %[[RES]]
+extern "C" __device__ Outer test_source_nested(int idx, Outer src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_union(
+// CHECK: entry:
+// CHECK:   %retval = alloca %union.U, align 4, addrspace(5)
+// CHECK:   %src = alloca %union.U, align 4, addrspace(5)
+// CHECK:   %idx.addr = alloca i32, align 4, addrspace(5)
+// CHECK:   %dsbperm.src = alloca %union.U, align 4, addrspace(5)
+// CHECK:   %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
+// CHECK:   %src1 = addrspacecast ptr addrspace(5) %src to ptr
+// CHECK:   %idx.addr.ascast = addrspacecast ptr addrspace(5) %idx.addr to ptr
+// CHECK:   %dsbperm.src.ascast = addrspacecast ptr addrspace(5) %dsbperm.src 
to ptr
+// Materialize src union from coerced piece
+// CHECK:   %[[COERCE_DST:.*]] = getelementptr inbounds nuw %union.U, ptr 
%src1, i32 0, i32 0
+// CHECK:   store i32 %src.coerce, ptr %[[COERCE_DST]], align 4
+// CHECK:   store i32 %idx, ptr %idx.addr.ascast, align 4
+// CHECK:   %[[IDX:.*]] = load i32, ptr %idx.addr.ascast, align 4
+// Forward src bytes to source buffer
+// CHECK:   call void @llvm.memcpy.p0.p0.i64(ptr align 4 %dsbperm.src.ascast, 
ptr align 4 %src1, i64 4, i1 false)
+// Single 4-byte word -> write directly to retval
+// CHECK:   %[[SRC0:.*]] = getelementptr inbounds i8, ptr %dsbperm.src.ascast, 
i64 0
+// CHECK:   %[[DST0:.*]] = getelementptr inbounds i8, ptr %retval.ascast, i64 0
+// CHECK:   %[[LD0:.*]] = load i32, ptr %[[SRC0]], align 4
+// CHECK:   %[[P0:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LD0]])
+// CHECK:   store i32 %[[P0]], ptr %[[DST0]], align 4
+// Coerce return from retval
+// CHECK:   %[[COERCE_OUT:.*]] = getelementptr inbounds{{.*}} %union.U, ptr 
%retval.ascast, i32 0, i32 0
+// CHECK:   %[[RES:.*]] = load i32, ptr %[[COERCE_OUT]], align 4
+// CHECK:   ret i32 %[[RES]]
+extern "C" __device__ U test_source_union(int idx, U src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
+}
+
+// CHECK-LABEL: define{{.*}}{ double, double } @test_source_cdouble(i32 
{{[^,]*}}, double noundef %src.coerce0, double noundef %src.coerce1)
+// Materialize the coerced _Complex double argument into the local aggregate
+// CHECK:   %retval = alloca { double, double }, align 8, addrspace(5)
+// CHECK:   %src = alloca { double, double }, align 8, addrspace(5)
+// CHECK:   %idx.addr = alloca i32, align 4, addrspace(5)
+// CHECK:   %agg.tmp = alloca { double, double }, align 8, addrspace(5)
+// CHECK:   %dsbperm.src = alloca { double, double }, align 8, addrspace(5)
+// CHECK:   %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
+// CHECK:   %src1 = addrspacecast ptr addrspace(5) %src to ptr
+// CHECK:   %idx.addr.ascast = addrspacecast ptr addrspace(5) %idx.addr to ptr
+// CHECK:   %agg.tmp.ascast = addrspacecast ptr addrspace(5) %agg.tmp to ptr
+// CHECK:   %dsbperm.src.ascast = addrspacecast ptr addrspace(5) %dsbperm.src 
to ptr
+// CHECK:   %[[SRC0_GEP:.*]] = getelementptr inbounds nuw { double, double }, 
ptr %src1, i32 0, i32 0
+// CHECK:   store double %src.coerce0, ptr %[[SRC0_GEP]], align 8
+// CHECK:   %[[SRC1_GEP:.*]] = getelementptr inbounds nuw { double, double }, 
ptr %src1, i32 0, i32 1
+// CHECK:   store double %src.coerce1, ptr %[[SRC1_GEP]], align 8
+// Load the real/imag parts and forward into the source buffer for permutation
+// CHECK:   store i32 %idx, ptr %idx.addr.ascast, align 4
+// CHECK:   %[[IDX:.*]] = load i32, ptr %idx.addr.ascast, align 4
+// CHECK:   %src1.realp = getelementptr inbounds nuw { double, double }, ptr 
%src1, i32 0, i32 0
+// CHECK:   %[[SRC_REAL:.*]] = load double, ptr %src1.realp, align 8
+// CHECK:   %src1.imagp = getelementptr inbounds nuw { double, double }, ptr 
%src1, i32 0, i32 1
+// CHECK:   %[[SRC_IMAG:.*]] = load double, ptr %src1.imagp, align 8
+// CHECK:   %[[SRCBUF_REALP:.*]] = getelementptr inbounds nuw { double, double 
}, ptr %dsbperm.src.ascast, i32 0, i32 0
+// CHECK:   %[[SRCBUF_IMAGP:.*]] = getelementptr inbounds nuw { double, double 
}, ptr %dsbperm.src.ascast, i32 0, i32 1
+// CHECK:   store double %[[SRC_REAL]], ptr %[[SRCBUF_REALP]], align 8
+// CHECK:   store double %[[SRC_IMAG]], ptr %[[SRCBUF_IMAGP]], align 8
+// Split the complex double into 32-bit words and pass each to the intrinsic; 
write into agg.tmp
+// CHECK:   %[[SRC_I8_0:.*]] = getelementptr inbounds i8, ptr 
%dsbperm.src.ascast, i64 0
+// CHECK:   %[[DST_I8_0:.*]] = getelementptr inbounds i8, ptr %agg.tmp.ascast, 
i64 0
+// CHECK:   %[[LDW0:.*]] = load i32, ptr %[[SRC_I8_0]], align 8
+// CHECK:   %[[P0:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LDW0]])
+// CHECK:   store i32 %[[P0]], ptr %[[DST_I8_0]], align 8
+// CHECK:   %[[SRC_I8_4:.*]] = getelementptr inbounds i8, ptr 
%dsbperm.src.ascast, i64 4
+// CHECK:   %[[DST_I8_4:.*]] = getelementptr inbounds i8, ptr %agg.tmp.ascast, 
i64 4
+// CHECK:   %[[LDW1:.*]] = load i32, ptr %[[SRC_I8_4]], align 4
+// CHECK:   %[[P1:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LDW1]])
+// CHECK:   store i32 %[[P1]], ptr %[[DST_I8_4]], align 4
+// CHECK:   %[[SRC_I8_8:.*]] = getelementptr inbounds i8, ptr 
%dsbperm.src.ascast, i64 8
+// CHECK:   %[[DST_I8_8:.*]] = getelementptr inbounds i8, ptr %agg.tmp.ascast, 
i64 8
+// CHECK:   %[[LDW2:.*]] = load i32, ptr %[[SRC_I8_8]], align 8
+// CHECK:   %[[P2:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LDW2]])
+// CHECK:   store i32 %[[P2]], ptr %[[DST_I8_8]], align 8
+// CHECK:   %[[SRC_I8_12:.*]] = getelementptr inbounds i8, ptr 
%dsbperm.src.ascast, i64 12
+// CHECK:   %[[DST_I8_12:.*]] = getelementptr inbounds i8, ptr 
%agg.tmp.ascast, i64 12
+// CHECK:   %[[LDW3:.*]] = load i32, ptr %[[SRC_I8_12]], align 4
+// CHECK:   %[[P3:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %[[IDX]], i32 
%[[LDW3]])
+// CHECK:   store i32 %[[P3]], ptr %[[DST_I8_12]], align 4
+// Reconstruct the complex double into retval and return it
+// CHECK:   %[[AGG_REALP:.*]] = getelementptr inbounds nuw { double, double }, 
ptr %agg.tmp.ascast, i32 0, i32 0
+// CHECK:   %[[AGG_REAL:.*]] = load double, ptr %[[AGG_REALP]], align 8
+// CHECK:   %[[AGG_IMAGP:.*]] = getelementptr inbounds nuw { double, double }, 
ptr %agg.tmp.ascast, i32 0, i32 1
+// CHECK:   %[[AGG_IMAG:.*]] = load double, ptr %[[AGG_IMAGP]], align 8
+// CHECK:   %[[RET_REALP:.*]] = getelementptr inbounds nuw { double, double }, 
ptr %retval.ascast, i32 0, i32 0
+// CHECK:   %[[RET_IMAGP:.*]] = getelementptr inbounds nuw { double, double }, 
ptr %retval.ascast, i32 0, i32 1
+// CHECK:   store double %[[AGG_REAL]], ptr %[[RET_REALP]], align 8
+// CHECK:   store double %[[AGG_IMAG]], ptr %[[RET_IMAGP]], align 8
+// CHECK:   %[[RETVAL_AGG:.*]] = load { double, double }, ptr %retval.ascast, 
align 8
+// CHECK:   ret { double, double } %[[RETVAL_AGG]]
+extern "C" __device__ _Complex double test_source_cdouble(int idx, _Complex 
double src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
+}
diff --git a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
new file mode 100644
index 0000000000000..9823bc1dc6719
--- /dev/null
+++ b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
@@ -0,0 +1,37 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -fsyntax-only -verify -triple amdgcn-amd-amdhsa -aux-triple 
x86_64-pc-linux-gnu -Wno-unused-value %s
+// RUN: %clang_cc1 -fsyntax-only -verify -triple x86_64-pc-linux-gnu 
-aux-triple amdgcn -Wno-unused-value %s
+#define __device__ __attribute__((device))
+
+typedef short short2 __attribute__((ext_vector_type(2)));
+typedef int int2 __attribute__((ext_vector_type(2)));
+
+struct A { int a; };
+
+__device__ void test_invalid_num_args(int a, int b, int c) {
+   __builtin_amdgcn_ds_bpermute(a, b, c); // expected-error {{too many 
arguments to function call, expected 2, have 3}}
+}
+
+__device__ void test_invalid_index(short2 a, int b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{converting 'short2' 
(vector of 2 'short' values) to incompatible type 'int'}}
+}
+
+__device__ void test_warn_long_index(long a, int b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-warning {{the 64-bit 
argument will be truncated to 32 bits in this call to 
__builtin_amdgcn_ds_bpermute}}
+}
+
+__device__ void test_vector_src(int a, int2 b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+__device__ void test_long_src(int a, long b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+__device__ void test_pointer_src(int a, void* b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+__device__ void test_struct_src(int a, A b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
+}
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index abd83c7c4d4a7..d0b82cdab46ab 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2766,10 +2766,10 @@ def int_amdgcn_ds_permute :
     [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // llvm.amdgcn.ds.bpermute <index> <src>
-def int_amdgcn_ds_bpermute :
-  ClangBuiltin<"__builtin_amdgcn_ds_bpermute">,
-  Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
-     [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+def int_amdgcn_ds_bpermute
+    : Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], [
+      IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree
+    ]>;
 
 // llvm.amdgcn.perm <src0> <src1> <selector>
 def int_amdgcn_perm :

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to