[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types (PR #153501)

Yaxun Liu via cfe-commits Wed, 11 Feb 2026 17:48:02 -0800

https://github.com/yxsamliu updated 
https://github.com/llvm/llvm-project/pull/153501


>From a63b24e006a81981cbeab8cf6b80c6e40e41ee3b Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <[email protected]>
Date: Sat, 23 Aug 2025 14:55:50 -0400
Subject: [PATCH] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types

This change makes __builtin_amdgcn_ds_bpermute polymorphic over arbitrary 
trivially copyable types: the source (second) argument may be any such type, 
and the return type matches it. CodeGen converts the source value to an integer 
of the same bit width, zero-extends it to the next multiple of 32 bits, and 
permutes each 32-bit word independently via the ds_bpermute_b32 intrinsic. The 
result is reassembled, truncated back to the original bit width, and converted 
back to the source type. Scalars and vectors use bitcast/ptrtoint; aggregates 
use a store-as-aggregate + load-as-integer pattern that SROA can optimize away.

This patch also introduces a new AMDGPUSupport.rst section documenting Clang's 
AMDGPU-specific builtins, starting with __builtin_amdgcn_ds_bpermute.
---
 clang/docs/AMDGPUSupport.rst                  |  91 +++++
 clang/include/clang/Basic/BuiltinsAMDGPU.td   |   2 +-
 .../clang/Basic/DiagnosticSemaKinds.td        |   5 +
 clang/lib/CodeGen/CGBuiltin.cpp               |  13 +-
 clang/lib/CodeGen/CodeGenFunction.h           |   3 +-
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   | 108 +++++-
 clang/lib/Sema/SemaAMDGPU.cpp                 |  44 +++
 clang/lib/Sema/SemaChecking.cpp               |   3 +-
 .../CodeGenHIP/builtin-amdgcn-ds-bpermute.hip | 344 ++++++++++++++++++
 .../builtin-amdgcn-ds-bpermute.cl             |  72 ++++
 .../SemaHIP/builtin-amdgcn-ds-bpermute.hip    |  37 ++
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td      |   1 -
 12 files changed, 714 insertions(+), 9 deletions(-)
 create mode 100644 clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
 create mode 100644 clang/test/CodeGenOpenCL/builtin-amdgcn-ds-bpermute.cl
 create mode 100644 clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip

diff --git a/clang/docs/AMDGPUSupport.rst b/clang/docs/AMDGPUSupport.rst
index 18e3de8abe92a..4aef02416681b 100644
--- a/clang/docs/AMDGPUSupport.rst
+++ b/clang/docs/AMDGPUSupport.rst
@@ -57,3 +57,94 @@ Predefined Macros
      - Defined if FP64 instruction is available (deprecated).
 
 Please note that the specific architecture and feature names will vary 
depending on the GPU. Also, some macros are deprecated and may be removed in 
future releases.
+
+AMDGPU Builtins
+===============
+
+Clang provides a set of builtins to access low-level, AMDGPU-specific hardware
+features directly from C, C++, OpenCL C, and HIP. These builtins often map
+directly to a single machine instruction.
+
+.. _builtin-amdgcn-ds-bpermute:
+
+``__builtin_amdgcn_ds_bpermute``
+--------------------------------
+
+Performs a backward (pull) permutation of values within a wavefront. This 
builtin compiles to the
+``ds_bpermute_b32`` instruction and implements a "read from lane" semantic 
using a **byte-based**
+address.
+
+**Syntax**
+
+.. code-block:: c++
+
+  T __builtin_amdgcn_ds_bpermute(int index, T src);
+
+**Summary**
+
+All active lanes in the current wavefront conceptually place their ``src`` 
payloads into an
+internal cross-lane buffer. Each lane then reads a 32-bit value from that 
buffer at the byte
+offset given by ``index`` and returns it as type ``T``. The exchange uses LDS 
hardware paths
+but does not access user-visible LDS or imply any synchronization.
+
+This builtin is **polymorphic**: the type of ``src`` determines the return 
type.
+
+Availability
+------------
+
+- Targets: AMD GCN3 (gfx8) and newer.
+
+Parameters
+----------
+
+- ``index`` (``int``): Byte offset used to select the source lane. Hardware 
only consumes bits
+  ``[7:2]``. To read the 32-bit value from lane *i*, pass ``i * 4`` as the 
index.
+  Indices that select lanes outside the current wave size or lanes that are 
inactive at the call
+  site yield an unspecified value (commonly zero on current hardware).
+
+- ``src`` (``T``): The value contributed by the current lane. Supported T 
includes scalar
+  integers and floating point types, pointers, vectors, aggregates 
(structs/unions/complex),
+  and trivially copyable C++ classes. The operation is purely bit-wise and 
trivially copies
+  memory representations; it does not invoke or respect C++ copy/move 
constructors or
+  assignment operators. For types whose total size is not a multiple of 32 
bits, the
+  value is zero-extended to the next multiple of 32 bits before permutation.
+  The value is then split into 32-bit words, each independently permuted via
+  ``ds_bpermute_b32`` using the same index, and the result is reassembled and
+  truncated back to the original bit width before being returned as type ``T``.
+
+Semantics and Guarantees
+------------------------
+
+* **Active lane participation**: Only lanes active in the EXEC mask at the 
call site
+  contribute a payload. Reading from an inactive source lane produces an 
unspecified value.
+
+* **Index per lane**: ``index`` may vary across lanes. Only bits ``[7:2]`` are 
used for lane
+  selection. Bits outside this range are ignored by hardware.
+
+* **No synchronization**: The builtin does not synchronize lanes or waves and 
does not
+  order memory operations. It doesn't read or write user-visible LDS.
+
+* **Wave size**: Valid source lanes are ``0 .. warpSize-1`` (use 
``warpSize``/equivalent to
+  query 32 vs 64). Selecting lanes outside that range yields an unspecified 
value.
+
+Examples
+--------
+
+Reverse within a wavefront (handles wave32 or wave64):
+
+.. code-block:: c++
+
+  #include <hip/hip_runtime.h>
+
+  __global__ void wavefront_reverse(float* data, int n) {
+    int tid     = blockIdx.x * blockDim.x + threadIdx.x;
+    if (tid >= n) return;
+
+    int lane    = threadIdx.x % warpSize;            // works for 32 or 64
+    int peer    = (warpSize - 1) - lane;             // reversed lane
+    int offset  = peer * 4;                          // byte address
+
+    float my_val      = data[tid];
+    float reversed    = __builtin_amdgcn_ds_bpermute(offset, my_val);
+    data[tid] = reversed;
+  }
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.td 
b/clang/include/clang/Basic/BuiltinsAMDGPU.td
index c1ca7d4fd8e77..00db039fcf391 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.td
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.td
@@ -137,7 +137,7 @@ def __builtin_amdgcn_s_decperflevel : 
AMDGPUBuiltin<"void(_Constant int)">;
 def __builtin_amdgcn_s_setprio : AMDGPUBuiltin<"void(_Constant short)">;
 def __builtin_amdgcn_ds_swizzle : AMDGPUBuiltin<"int(int, _Constant int)", 
[Const]>;
 def __builtin_amdgcn_ds_permute : AMDGPUBuiltin<"int(int, int)", [Const]>;
-def __builtin_amdgcn_ds_bpermute : AMDGPUBuiltin<"int(int, int)", [Const]>;
+def __builtin_amdgcn_ds_bpermute : AMDGPUBuiltin<"void(...)", [Const, 
CustomTypeChecking]>;
 def __builtin_amdgcn_readfirstlane : AMDGPUBuiltin<"int(int)", [Const]>;
 def __builtin_amdgcn_readlane : AMDGPUBuiltin<"int(int, int)", [Const]>;
 def __builtin_amdgcn_fmed3f : AMDGPUBuiltin<"float(float, float, float)", 
[Const]>;
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td 
b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 887d1b5f2bbfd..266cd80d12413 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13931,6 +13931,10 @@ def note_acc_reduction_combiner_forming
 // AMDGCN builtins diagnostics
 def err_amdgcn_load_lds_size_invalid_value : Error<"invalid size value">;
 def note_amdgcn_load_lds_size_valid_value : Note<"size must be %select{1, 2, 
or 4|1, 2, 4, 12 or 16}0">;
+def warn_amdgcn_builtin_arg_truncation
+    : Warning<"the %0-bit argument will be truncated to %1 bits in this call "
+              "to %2">,
+      InGroup<DiagGroup<"amdgcn-builtin-arg-trunc">>;
 
 def err_amdgcn_coop_atomic_invalid_as : Error<"cooperative atomic requires a 
global or generic pointer">;
 
@@ -13953,4 +13957,5 @@ def err_cuda_device_kernel_launch_not_supported
 def err_cuda_device_kernel_launch_require_rdc
     : Error<"kernel launch from __device__ or __global__ function requires "
             "relocatable device code (i.e. requires -fgpu-rdc)">;
+
 } // end of sema component.
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 4e1b22f0a2241..b605985f85b4e 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -108,7 +108,7 @@ static Value *EmitTargetArchBuiltinExpr(CodeGenFunction 
*CGF,
     return CGF->EmitPPCBuiltinExpr(BuiltinID, E);
   case llvm::Triple::r600:
   case llvm::Triple::amdgcn:
-    return CGF->EmitAMDGPUBuiltinExpr(BuiltinID, E);
+    return CGF->EmitAMDGPUBuiltinExpr(BuiltinID, E, ReturnValue);
   case llvm::Triple::systemz:
     return CGF->EmitSystemZBuiltinExpr(BuiltinID, E);
   case llvm::Triple::nvptx:
@@ -127,7 +127,7 @@ static Value *EmitTargetArchBuiltinExpr(CodeGenFunction 
*CGF,
   case llvm::Triple::spirv32:
   case llvm::Triple::spirv64:
     if (CGF->getTarget().getTriple().getOS() == llvm::Triple::OSType::AMDHSA)
-      return CGF->EmitAMDGPUBuiltinExpr(BuiltinID, E);
+      return CGF->EmitAMDGPUBuiltinExpr(BuiltinID, E, ReturnValue);
     [[fallthrough]];
   case llvm::Triple::spirv:
     return CGF->EmitSPIRVBuiltinExpr(BuiltinID, E);
@@ -6564,7 +6564,8 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl 
GD, unsigned BuiltinID,
   // ReturnValue to be non-null, so that the target-specific emission code can
   // always just emit into it.
   TypeEvaluationKind EvalKind = getEvaluationKind(E->getType());
-  if (EvalKind == TEK_Aggregate && ReturnValue.isNull()) {
+  if ((EvalKind == TEK_Aggregate || EvalKind == TEK_Complex) &&
+      ReturnValue.isNull()) {
     Address DestPtr = CreateMemTemp(E->getType(), "agg.tmp");
     ReturnValue = ReturnValueSlot(DestPtr, false);
   }
@@ -6580,7 +6581,11 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl 
GD, unsigned BuiltinID,
       return RValue::getAggregate(ReturnValue.getAddress(),
                                   ReturnValue.isVolatile());
     case TEK_Complex:
-      llvm_unreachable("No current target builtin returns complex");
+      // Build an LValue for the provided return slot and load the complex
+      // result.
+      LValue LV = MakeAddrLValue(ReturnValue.getAddress(), E->getType());
+      ComplexPairTy C = EmitLoadOfComplex(LV, E->getExprLoc());
+      return RValue::getComplex(C);
     }
     llvm_unreachable("Bad evaluation kind in EmitBuiltinExpr");
   }
diff --git a/clang/lib/CodeGen/CodeGenFunction.h 
b/clang/lib/CodeGen/CodeGenFunction.h
index f769fee227878..15e01f660edff 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -4897,7 +4897,8 @@ class CodeGenFunction : public CodeGenTypeCache {
   llvm::Value *BuildVector(ArrayRef<llvm::Value *> Ops);
   llvm::Value *EmitX86BuiltinExpr(unsigned BuiltinID, const CallExpr *E);
   llvm::Value *EmitPPCBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
-  llvm::Value *EmitAMDGPUBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
+  llvm::Value *EmitAMDGPUBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
+                                     ReturnValueSlot ReturnValue);
   llvm::Value *EmitHLSLBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
                                    ReturnValueSlot ReturnValue);
 
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index bff1ed3d2ec19..1eec607ff1bd2 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -165,6 +165,107 @@ Value *EmitAMDGPUGridSize(CodeGenFunction &CGF, unsigned 
Index) {
                   llvm::MDNode::get(CGF.getLLVMContext(), {}));
   return LD;
 }
+
+// Emits LLVM IR to lower __builtin_amdgcn_ds_bpermute over arbitrary
+// trivially-copyable types. The payload is converted to an integer, extended
+// to a multiple of 32 bits, and each 32-bit word is permuted via the hardware
+// ds_bpermute_b32 intrinsic. For aggregate types, a single store+load-as-int
+// is used, which SROA can optimize for small types.
+llvm::Value *emitAMDGCNDsBpermute(CodeGenFunction &CGF,
+                                  const CallExpr *Call,
+                                  ReturnValueSlot Dest) {
+  CGBuilderTy &Builder = CGF.Builder;
+  CodeGenModule &CGM = CGF.CGM;
+  const llvm::DataLayout &DL = CGM.getDataLayout();
+
+  llvm::Type *I32Ty = Builder.getInt32Ty();
+  llvm::Function *BpermFn =
+      CGM.getIntrinsic(llvm::Intrinsic::amdgcn_ds_bpermute);
+
+  // Coerce index argument to i32.
+  llvm::Value *Index = CGF.EmitScalarExpr(Call->getArg(0));
+  if (Index->getType()->isPointerTy())
+    Index = Builder.CreatePtrToInt(
+        Index,
+        Builder.getIntNTy(DL.getPointerSizeInBits(
+            Index->getType()->getPointerAddressSpace())));
+  Index = Builder.CreateIntCast(Index, I32Ty, /*isSigned=*/false);
+
+  QualType SrcQT = Call->getArg(1)->getType();
+  llvm::Type *RetTy = CGF.ConvertType(Call->getType());
+  bool IsAggregate = SrcQT->isAggregateType() || SrcQT->isAnyComplexType();
+
+  // Convert the source value to an integer of the same bit width.
+  // For scalars/vectors: bitcast or ptrtoint to iN.
+  // For aggregates: store to temp, load as iN.
+  llvm::Value *SrcInt;
+  unsigned SrcBits;
+
+  if (!IsAggregate) {
+    llvm::Value *SrcVal = CGF.EmitScalarExpr(Call->getArg(1));
+    SrcBits = DL.getTypeSizeInBits(SrcVal->getType()).getFixedValue();
+    llvm::Type *SrcIntTy = Builder.getIntNTy(SrcBits);
+
+    SrcInt = Builder.CreateBitOrPointerCast(SrcVal, SrcIntTy);
+  } else {
+    llvm::Type *SrcTy = CGF.ConvertType(SrcQT);
+    uint64_t SizeBytes = DL.getTypeAllocSize(SrcTy);
+    SrcBits = SizeBytes * 8;
+    llvm::Type *SrcIntTy = Builder.getIntNTy(SrcBits);
+
+    // Store aggregate to temp, load back as integer.
+    Address SrcAddr = CGF.CreateMemTemp(SrcQT, "bperm.src");
+    CGF.EmitAnyExprToMem(Call->getArg(1), SrcAddr, SrcQT.getQualifiers(),
+                         /*IsInit=*/true);
+    Address AsIntAddr = SrcAddr.withElementType(SrcIntTy);
+    SrcInt = Builder.CreateLoad(AsIntAddr);
+  }
+
+  // Zero-extend to the next multiple of 32 bits.
+  unsigned PaddedBits = llvm::alignTo(SrcBits, 32u);
+  SrcInt = Builder.CreateZExtOrTrunc(SrcInt, Builder.getIntNTy(PaddedBits));
+
+  // Permute each 32-bit word.
+  unsigned NumWords = PaddedBits / 32;
+  llvm::Value *ResInt = nullptr;
+
+  for (unsigned I = 0; I < NumWords; ++I) {
+    // Extract word: (SrcInt >> (I * 32)) & 0xFFFFFFFF
+    llvm::Value *Word = SrcInt;
+    if (I > 0)
+      Word = Builder.CreateLShr(Word,
+                 llvm::ConstantInt::get(Word->getType(), I * 32));
+    Word = Builder.CreateTrunc(Word, I32Ty);
+
+    // Call ds_bpermute on this word.
+    llvm::Value *Perm = Builder.CreateCall(BpermFn, {Index, Word});
+
+    // Place result word into ResInt at the correct position.
+    llvm::Value *Extended =
+        Builder.CreateZExt(Perm, Builder.getIntNTy(PaddedBits));
+    if (I > 0)
+      Extended = Builder.CreateShl(Extended,
+                     llvm::ConstantInt::get(Extended->getType(), I * 32));
+    ResInt = (I == 0) ? Extended : Builder.CreateOr(ResInt, Extended);
+  }
+
+  // Truncate back to original bit width.
+  ResInt = Builder.CreateZExtOrTrunc(ResInt, Builder.getIntNTy(SrcBits));
+
+  // Convert back to the original type.
+  if (IsAggregate) {
+    // Store integer to temp, load back as aggregate.
+    llvm::Type *ResTy = CGF.ConvertType(SrcQT);
+    Address DestAddr = Dest.getAddress();
+    Address AsIntAddr = DestAddr.withElementType(Builder.getIntNTy(SrcBits));
+    Builder.CreateStore(ResInt, AsIntAddr);
+    return Builder.getTrue();
+  }
+
+  // Scalar/vector result.
+  return Builder.CreateBitOrPointerCast(ResInt, RetTy);
+}
+
 } // namespace
 
 // Generates the IR for __builtin_read_exec_*.
@@ -416,7 +517,8 @@ static Intrinsic::ID 
getIntrinsicIDforWaveReduction(unsigned BuiltinID) {
 }
 
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
-                                              const CallExpr *E) {
+                                              const CallExpr *E,
+                                              ReturnValueSlot ReturnValue) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
   llvm::SyncScope::ID SSID;
   switch (BuiltinID) {
@@ -493,6 +595,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_ds_swizzle:
     return emitBuiltinWithOneOverloadedType<2>(*this, E,
                                                Intrinsic::amdgcn_ds_swizzle);
+
+  case AMDGPU::BI__builtin_amdgcn_ds_bpermute:
+    return emitAMDGCNDsBpermute(*this, E, ReturnValue);
+
   case AMDGPU::BI__builtin_amdgcn_mov_dpp8:
   case AMDGPU::BI__builtin_amdgcn_mov_dpp:
   case AMDGPU::BI__builtin_amdgcn_update_dpp: {
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index c9833f5083d07..ac01477f20e7d 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -20,6 +20,48 @@
 #include "llvm/Support/AtomicOrdering.h"
 #include <cstdint>
 
+namespace {
+
+using llvm::StringRef;
+using namespace clang;
+
+/// Validates and coerces the arguments to __builtin_amdgcn_ds_bpermute.
+/// Ensures the first argument (index) is int32 (with truncation warning as
+/// needed), and set the return type to be the same as the second argument
+/// (source).
+bool checkDsBpermuteFunctionCall(Sema &SemaRef, CallExpr *TheCall) {
+  if (SemaRef.checkArgCount(TheCall, 2))
+    return true;
+
+  ASTContext &AstContext = SemaRef.getASTContext();
+
+  const FunctionDecl *FuncDecl = TheCall->getDirectCallee();
+  StringRef BuiltinName = FuncDecl ? FuncDecl->getName()
+                                   : StringRef("__builtin_amdgcn_ds_bpermute");
+
+  Expr *IndexArg = TheCall->getArg(0);
+  QualType Int32Ty = AstContext.IntTy;
+
+  if (AstContext.getTypeSize(IndexArg->getType()) > 32 &&
+      !IndexArg->getType()->isRecordType())
+    SemaRef.Diag(IndexArg->getBeginLoc(),
+                 diag::warn_amdgcn_builtin_arg_truncation)
+        << AstContext.getTypeSize(IndexArg->getType()) << 32 << BuiltinName
+        << IndexArg->getSourceRange();
+
+  ExprResult ConvResult = SemaRef.PerformImplicitConversion(
+      IndexArg, Int32Ty, AssignmentAction::Converting);
+  if (ConvResult.isInvalid())
+    return true;
+
+  TheCall->setArg(0, ConvResult.get());
+  TheCall->setType(TheCall->getArg(1)->getType());
+
+  return false;
+}
+
+} // anonymous namespace
+
 namespace clang {
 
 SemaAMDGPU::SemaAMDGPU(Sema &S) : SemaBase(S) {}
@@ -331,6 +373,8 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
     }
     return false;
   }
+  case AMDGPU::BI__builtin_amdgcn_ds_bpermute:
+    return checkDsBpermuteFunctionCall(SemaRef, TheCall);
   default:
     return false;
   }
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index 89171246d0bcb..0cffcd50fc130 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -2130,7 +2130,8 @@ bool Sema::CheckTSBuiltinFunctionCall(const TargetInfo 
&TI, unsigned BuiltinID,
   case llvm::Triple::spirv64:
     if (TI.getTriple().getOS() != llvm::Triple::OSType::AMDHSA)
       return SPIRV().CheckSPIRVBuiltinFunctionCall(TI, BuiltinID, TheCall);
-    return false;
+    else
+      return AMDGPU().CheckAMDGCNBuiltinFunctionCall(BuiltinID, TheCall);
   case llvm::Triple::systemz:
     return SystemZ().CheckSystemZBuiltinFunctionCall(BuiltinID, TheCall);
   case llvm::Triple::x86:
diff --git a/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
new file mode 100644
index 0000000000000..cdc6b840cc7ad
--- /dev/null
+++ b/clang/test/CodeGenHIP/builtin-amdgcn-ds-bpermute.hip
@@ -0,0 +1,344 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1200 -aux-triple 
x86_64-pc-linux-gnu \
+// RUN:    -fcuda-is-device -emit-llvm -o - %s | FileCheck %s
+
+typedef short short2 __attribute__((vector_size(4)));
+typedef double double2 __attribute__((ext_vector_type(2)));
+
+struct Inner { short a; char b; };
+struct Outer { int x; struct Inner y; char z; };
+struct __attribute__((packed)) Three { char a, b, c; };
+
+union U { int i; char c; };
+
+#define __device__ __attribute__((device))
+
+// CHECK-LABEL: define{{.*}}@test_index_i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 %1)
+extern "C" __device__ int test_index_i32(int a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_long
+// CHECK: [[TRUNC:%.*]] = trunc i64 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[TRUNC]], i32 %1)
+extern "C" __device__ int test_index_long(long a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_short
+// CHECK: [[EXT:%.*]] = sext i16 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[EXT]], i32 %1)
+extern "C" __device__ int test_index_short(short a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_float
+// CHECK: [[CONV:%.*]] = fptosi float %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[CONV]], i32 %1)
+extern "C" __device__ int test_index_float(float a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_index_struct
+// CHECK: [[CALL:%.*]] = call noundef i32 @_ZNK11ConvertiblecviEv(
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[CALL]], i32 %{{[0-9]+}})
+struct Convertible {
+  int value;
+  __device__ operator int() const { return value; }
+};
+
+extern "C" __device__ int test_index_struct(Convertible a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+template<typename T>
+__device__ int test_template(T a, int c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_template_short
+// CHECK: [[EXT:%.*]] = sext i16 %0 to i32
+// CHECK: call i32 @llvm.amdgcn.ds.bpermute(i32 [[EXT]], i32 %1)
+extern "C" __device__ int test_template_short(short a, int c) {
+   return test_template<short>(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_float
+// CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: ret float [[RESULT]]
+extern "C" __device__ float test_source_float(int a, float c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_v2i16
+// CHECK: [[BITCAST:%.*]] = bitcast <2 x i16> %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to <2 x i16>
+// CHECK: ret <2 x i16> [[RESULT]]
+extern "C" __device__ short2 test_source_v2i16(int a, short2 c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_float16
+// CHECK: [[BITCAST:%.*]] = bitcast half %1 to i16
+// CHECK: [[ZEXT:%.*]] = zext i16 [[BITCAST]] to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: [[RESULT:%.*]] = bitcast i16 [[TRUNC]] to half
+// CHECK: ret half [[RESULT]]
+extern "C" __device__ _Float16 test_source_float16(int a, _Float16 c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_short
+// CHECK: [[ZEXT:%.*]] = zext i16 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: ret i16 [[TRUNC]]
+extern "C" __device__ short test_source_short(int a, short c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ushort
+// CHECK: [[ZEXT:%.*]] = zext i16 %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[ZEXT]])
+// CHECK: [[TRUNC:%.*]] = trunc i32 [[CALL]] to i16
+// CHECK: ret i16 [[TRUNC]]
+extern "C" __device__ unsigned short test_source_ushort(int a, unsigned short 
c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_long
+// CHECK: [[LO:%.*]] = trunc i64 %1 to i32
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = zext i32 [[RLO]] to i64
+// CHECK: [[SHR:%.*]] = lshr i64 %1, 32
+// CHECK: [[HI:%.*]] = trunc i64 [[SHR]] to i32
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = zext i32 [[RHI]] to i64
+// CHECK: [[V1S:%.*]] = shl i64 [[V1]], 32
+// CHECK: [[RES:%.*]] = or i64 [[V0]], [[V1S]]
+// CHECK: ret i64 [[RES]]
+extern "C" __device__ long test_source_long(int a, long c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ulong
+// CHECK: [[LO:%.*]] = trunc i64 %1 to i32
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = zext i32 [[RLO]] to i64
+// CHECK: [[SHR:%.*]] = lshr i64 %1, 32
+// CHECK: [[HI:%.*]] = trunc i64 [[SHR]] to i32
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = zext i32 [[RHI]] to i64
+// CHECK: [[V1S:%.*]] = shl i64 [[V1]], 32
+// CHECK: [[RES:%.*]] = or i64 [[V0]], [[V1S]]
+// CHECK: ret i64 [[RES]]
+extern "C" __device__ unsigned long test_source_ulong(int a, unsigned long c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_double
+// CHECK: [[BC:%.*]] = bitcast double %1 to i64
+// CHECK: [[LO:%.*]] = trunc i64 [[BC]] to i32
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = zext i32 [[RLO]] to i64
+// CHECK: [[SHR:%.*]] = lshr i64 [[BC]], 32
+// CHECK: [[HI:%.*]] = trunc i64 [[SHR]] to i32
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = zext i32 [[RHI]] to i64
+// CHECK: [[V1S:%.*]] = shl i64 [[V1]], 32
+// CHECK: [[RES64:%.*]] = or i64 [[V0]], [[V1S]]
+// CHECK: [[RES:%.*]] = bitcast i64 [[RES64]] to double
+// CHECK: ret double [[RES]]
+extern "C" __device__ double test_source_double(int a, double c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_ptr
+// CHECK: [[P2I:%.*]] = ptrtoint ptr %1 to i64
+// CHECK: [[LO:%.*]] = trunc i64 [[P2I]] to i32
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = zext i32 [[RLO]] to i64
+// CHECK: [[SHR:%.*]] = lshr i64 [[P2I]], 32
+// CHECK: [[HI:%.*]] = trunc i64 [[SHR]] to i32
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = zext i32 [[RHI]] to i64
+// CHECK: [[V1S:%.*]] = shl i64 [[V1]], 32
+// CHECK: [[I64RES:%.*]] = or i64 [[V0]], [[V1S]]
+// CHECK: [[PRES:%.*]] = inttoptr i64 [[I64RES]] to ptr
+// CHECK: ret ptr [[PRES]]
+extern "C" __device__ void* test_source_ptr(int a, void* c) {
+  return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// Address space 1 (global) pointer: 64-bit, split into 2 words
+// CHECK-LABEL: define{{.*}}@test_source_global_ptr
+// CHECK: [[P2I:%.*]] = ptrtoint ptr addrspace(1) %1 to i64
+// CHECK: [[LO:%.*]] = trunc i64 [[P2I]] to i32
+// CHECK: [[RLO:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[LO]])
+// CHECK: [[V0:%.*]] = zext i32 [[RLO]] to i64
+// CHECK: [[SHR:%.*]] = lshr i64 [[P2I]], 32
+// CHECK: [[HI:%.*]] = trunc i64 [[SHR]] to i32
+// CHECK: [[RHI:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[HI]])
+// CHECK: [[V1:%.*]] = zext i32 [[RHI]] to i64
+// CHECK: [[V1S:%.*]] = shl i64 [[V1]], 32
+// CHECK: [[I64RES:%.*]] = or i64 [[V0]], [[V1S]]
+// CHECK: [[PRES:%.*]] = inttoptr i64 [[I64RES]] to ptr addrspace(1)
+// CHECK: ret ptr addrspace(1) [[PRES]]
+extern "C" __device__ int __attribute__((address_space(1)))* 
test_source_global_ptr(int a, int __attribute__((address_space(1)))* c) {
+  return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// Address space 3 (LDS) pointer: 32-bit, single bpermute
+// CHECK-LABEL: define{{.*}}@test_source_local_ptr
+// CHECK: [[P2I:%.*]] = ptrtoint ptr addrspace(3) %1 to i32
+// CHECK: [[PERM:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[P2I]])
+// CHECK: [[PRES:%.*]] = inttoptr i32 [[PERM]] to ptr addrspace(3)
+// CHECK: ret ptr addrspace(3) [[PRES]]
+extern "C" __device__ int __attribute__((address_space(3)))* 
test_source_local_ptr(int a, int __attribute__((address_space(3)))* c) {
+  return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_double2
+// CHECK: [[BC:%.*]] = bitcast <2 x double> %1 to i128
+// CHECK: [[W0:%.*]] = trunc i128 [[BC]] to i32
+// CHECK: [[R0:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[W0]])
+// CHECK: [[Z0:%.*]] = zext i32 [[R0]] to i128
+// CHECK: [[SHR1:%.*]] = lshr i128 [[BC]], 32
+// CHECK: [[W1:%.*]] = trunc i128 [[SHR1]] to i32
+// CHECK: [[R1:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[W1]])
+// CHECK: [[Z1:%.*]] = zext i32 [[R1]] to i128
+// CHECK: [[S1:%.*]] = shl i128 [[Z1]], 32
+// CHECK: [[OR1:%.*]] = or i128 [[Z0]], [[S1]]
+// CHECK: [[SHR2:%.*]] = lshr i128 [[BC]], 64
+// CHECK: [[W2:%.*]] = trunc i128 [[SHR2]] to i32
+// CHECK: [[R2:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[W2]])
+// CHECK: [[Z2:%.*]] = zext i32 [[R2]] to i128
+// CHECK: [[S2:%.*]] = shl i128 [[Z2]], 64
+// CHECK: [[OR2:%.*]] = or i128 [[OR1]], [[S2]]
+// CHECK: [[SHR3:%.*]] = lshr i128 [[BC]], 96
+// CHECK: [[W3:%.*]] = trunc i128 [[SHR3]] to i32
+// CHECK: [[R3:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 [[W3]])
+// CHECK: [[Z3:%.*]] = zext i32 [[R3]] to i128
+// CHECK: [[S3:%.*]] = shl i128 [[Z3]], 96
+// CHECK: [[OR3:%.*]] = or i128 [[OR2]], [[S3]]
+// CHECK: [[RES:%.*]] = bitcast i128 [[OR3]] to <2 x double>
+// CHECK: ret <2 x double> [[RES]]
+extern "C" __device__ double2 test_source_double2(int a, double2 c) {
+  return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_template_float_src
+// CHECK: [[BITCAST:%.*]] = bitcast float %1 to i32
+// CHECK: [[CALL:%.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %0, i32 
[[BITCAST]])
+// CHECK: [[RESULT:%.*]] = bitcast i32 [[CALL]] to float
+// CHECK: ret float [[RESULT]]
+template<typename T>
+__device__ T test_template_src(int a, T c) {
+   return __builtin_amdgcn_ds_bpermute(a, c);
+}
+
+extern "C" __device__ float test_template_float_src(int a, float c) {
+   return test_template_src<float>(a, c);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_nested(
+// Materialize src aggregate, copy to bperm.src, load as i96
+// CHECK:   call void @llvm.memcpy.p0.p0.i64(ptr align 4 %bperm.src.ascast, 
ptr align 4 %src1, i64 12, i1 false)
+// CHECK:   %[[SRCINT:.*]] = load i96, ptr %bperm.src.ascast, align 4
+// Word 0
+// CHECK:   %[[W0:.*]] = trunc i96 %[[SRCINT]] to i32
+// CHECK:   %[[R0:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[W0]])
+// CHECK:   %[[Z0:.*]] = zext i32 %[[R0]] to i96
+// Word 1
+// CHECK:   %[[SHR1:.*]] = lshr i96 %[[SRCINT]], 32
+// CHECK:   %[[W1:.*]] = trunc i96 %[[SHR1]] to i32
+// CHECK:   %[[R1:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[W1]])
+// CHECK:   %[[Z1:.*]] = zext i32 %[[R1]] to i96
+// CHECK:   %[[S1:.*]] = shl i96 %[[Z1]], 32
+// CHECK:   %[[OR1:.*]] = or i96 %[[Z0]], %[[S1]]
+// Word 2
+// CHECK:   %[[SHR2:.*]] = lshr i96 %[[SRCINT]], 64
+// CHECK:   %[[W2:.*]] = trunc i96 %[[SHR2]] to i32
+// CHECK:   %[[R2:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[W2]])
+// CHECK:   %[[Z2:.*]] = zext i32 %[[R2]] to i96
+// CHECK:   %[[S2:.*]] = shl i96 %[[Z2]], 64
+// CHECK:   %[[OR2:.*]] = or i96 %[[OR1]], %[[S2]]
+// Store result and return
+// CHECK:   store i96 %[[OR2]], ptr addrspace(5) %retval, align 4
+// CHECK:   %[[RES:.*]] = load %struct.Outer, ptr addrspace(5) %retval, align 4
+// CHECK:   ret %struct.Outer %[[RES]]
+extern "C" __device__ Outer test_source_nested(int idx, Outer src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_packed3(
+// 3-byte packed struct: load as i24, zext to i32, permute, trunc back to i24
+// CHECK:   call void @llvm.memcpy.p0.p0.i64(ptr align 1 %bperm.src.ascast, 
ptr align 1 %src1, i64 3, i1 false)
+// CHECK:   %[[SRCINT:.*]] = load i24, ptr %bperm.src.ascast, align 1
+// CHECK:   %[[ZEXT:.*]] = zext i24 %[[SRCINT]] to i32
+// CHECK:   %[[PERM:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[ZEXT]])
+// CHECK:   %[[TRUNC:.*]] = trunc i32 %[[PERM]] to i24
+// CHECK:   store i24 %[[TRUNC]], ptr addrspace(5) %retval, align 1
+extern "C" __device__ Three test_source_packed3(int idx, Three src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
+}
+
+// CHECK-LABEL: define{{.*}}@test_source_union(
+// Copy to bperm.src, load as i32, permute, store back
+// CHECK:   call void @llvm.memcpy.p0.p0.i64(ptr align 4 %bperm.src.ascast, 
ptr align 4 %src1, i64 4, i1 false)
+// CHECK:   %[[SRCINT:.*]] = load i32, ptr %bperm.src.ascast, align 4
+// CHECK:   %[[PERM:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[SRCINT]])
+// CHECK:   store i32 %[[PERM]], ptr addrspace(5) %retval, align 4
+// CHECK:   %[[COERCE_OUT:.*]] = getelementptr inbounds{{.*}} %union.U, ptr 
addrspace(5) %retval, i32 0, i32 0
+// CHECK:   %[[RES:.*]] = load i32, ptr addrspace(5) %[[COERCE_OUT]], align 4
+// CHECK:   ret i32 %[[RES]]
+extern "C" __device__ U test_source_union(int idx, U src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
+}
+
+// CHECK-LABEL: define{{.*}}{ double, double } @test_source_cdouble(i32 
{{[^,]*}}, double noundef %src.coerce0, double noundef %src.coerce1)
+// Emit complex to bperm.src, load as i128, permute 4 words, store back
+// CHECK:   store double %src1.real, ptr %bperm.src.ascast.realp, align 8
+// CHECK:   store double %src1.imag, ptr %bperm.src.ascast.imagp, align 8
+// CHECK:   %[[SRCINT:.*]] = load i128, ptr %bperm.src.ascast, align 8
+// Word 0
+// CHECK:   %[[W0:.*]] = trunc i128 %[[SRCINT]] to i32
+// CHECK:   %[[R0:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[W0]])
+// CHECK:   %[[Z0:.*]] = zext i32 %[[R0]] to i128
+// Word 1
+// CHECK:   %[[SHR1:.*]] = lshr i128 %[[SRCINT]], 32
+// CHECK:   %[[W1:.*]] = trunc i128 %[[SHR1]] to i32
+// CHECK:   %[[R1:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[W1]])
+// CHECK:   %[[Z1:.*]] = zext i32 %[[R1]] to i128
+// CHECK:   %[[S1:.*]] = shl i128 %[[Z1]], 32
+// CHECK:   %[[OR1:.*]] = or i128 %[[Z0]], %[[S1]]
+// Word 2
+// CHECK:   %[[SHR2:.*]] = lshr i128 %[[SRCINT]], 64
+// CHECK:   %[[W2:.*]] = trunc i128 %[[SHR2]] to i32
+// CHECK:   %[[R2:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[W2]])
+// CHECK:   %[[Z2:.*]] = zext i32 %[[R2]] to i128
+// CHECK:   %[[S2:.*]] = shl i128 %[[Z2]], 64
+// CHECK:   %[[OR2:.*]] = or i128 %[[OR1]], %[[S2]]
+// Word 3
+// CHECK:   %[[SHR3:.*]] = lshr i128 %[[SRCINT]], 96
+// CHECK:   %[[W3:.*]] = trunc i128 %[[SHR3]] to i32
+// CHECK:   %[[R3:.*]] = call i32 @llvm.amdgcn.ds.bpermute(i32 %{{[0-9]+}}, 
i32 %[[W3]])
+// CHECK:   %[[Z3:.*]] = zext i32 %[[R3]] to i128
+// CHECK:   %[[S3:.*]] = shl i128 %[[Z3]], 96
+// CHECK:   %[[OR3:.*]] = or i128 %[[OR2]], %[[S3]]
+// Store result integer to agg.tmp, reconstruct complex, return
+// CHECK:   store i128 %[[OR3]], ptr %agg.tmp.ascast, align 8
+// CHECK:   %[[AGG_REAL:.*]] = load double, ptr %agg.tmp.ascast.realp, align 8
+// CHECK:   %[[AGG_IMAG:.*]] = load double, ptr %agg.tmp.ascast.imagp, align 8
+// CHECK:   store double %[[AGG_REAL]], ptr addrspace(5) %retval.realp, align 8
+// CHECK:   store double %[[AGG_IMAG]], ptr addrspace(5) %retval.imagp, align 8
+// CHECK:   %[[RETVAL:.*]] = load { double, double }, ptr addrspace(5) 
%retval, align 8
+// CHECK:   ret { double, double } %[[RETVAL]]
+extern "C" __device__ _Complex double test_source_cdouble(int idx, _Complex 
double src) {
+  return __builtin_amdgcn_ds_bpermute(idx, src);
+}
diff --git a/clang/test/CodeGenOpenCL/builtin-amdgcn-ds-bpermute.cl 
b/clang/test/CodeGenOpenCL/builtin-amdgcn-ds-bpermute.cl
new file mode 100644
index 0000000000000..4444bfb06d025
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/builtin-amdgcn-ds-bpermute.cl
@@ -0,0 +1,72 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-unknown-unknown -target-cpu 
gfx1200 \
+// RUN:    -emit-llvm -o - %s | FileCheck %s
+
+#pragma OPENCL EXTENSION cl_khr_fp64 : enable
+
+// CHECK-LABEL: @test_int
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 %b)
+void test_int(global int* out, int a, int b) {
+  *out = __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+// CHECK-LABEL: @test_float
+// CHECK: [[BC:%.*]] = bitcast float %b to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[BC]])
+void test_float(global float* out, int a, float b) {
+  *out = __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+// CHECK-LABEL: @test_long
+// CHECK: [[LO:%.*]] = trunc i64 %b to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[LO]])
+// CHECK: [[SHR:%.*]] = lshr i64 %b, 32
+// CHECK: [[HI:%.*]] = trunc {{.*}}i64 [[SHR]] to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[HI]])
+void test_long(global long* out, int a, long b) {
+  *out = __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+// CHECK-LABEL: @test_double
+// CHECK: [[BC:%.*]] = bitcast double %b to i64
+// CHECK: [[LO:%.*]] = trunc i64 [[BC]] to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[LO]])
+// CHECK: [[SHR:%.*]] = lshr i64 [[BC]], 32
+// CHECK: [[HI:%.*]] = trunc {{.*}}i64 [[SHR]] to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[HI]])
+void test_double(global double* out, int a, double b) {
+  *out = __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+// Global pointer: 64-bit (address space 1), split into 2 words
+// CHECK-LABEL: @test_global_ptr
+// CHECK: [[P2I:%.*]] = ptrtoint ptr addrspace(1) %b to i64
+// CHECK: [[LO:%.*]] = trunc i64 [[P2I]] to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[LO]])
+// CHECK: [[SHR:%.*]] = lshr i64 [[P2I]], 32
+// CHECK: [[HI:%.*]] = trunc {{.*}}i64 [[SHR]] to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[HI]])
+void test_global_ptr(global long* out, int a, global int* b) {
+  global int* res = __builtin_amdgcn_ds_bpermute(a, b);
+  *out = (long)res;
+}
+
+// Local pointer: 32-bit (address space 3), single bpermute
+// CHECK-LABEL: @test_local_ptr
+// CHECK: [[P2I:%.*]] = ptrtoint ptr addrspace(3) %b to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[P2I]])
+// CHECK-NOT: lshr
+void test_local_ptr(global int* out, int a, local int* b) {
+  local int* res = __builtin_amdgcn_ds_bpermute(a, b);
+  *out = (int)(long)res;
+}
+
+// Private pointer: 32-bit (address space 5), single bpermute
+// CHECK-LABEL: @test_private_ptr
+// CHECK: [[P2I:%.*]] = ptrtoint ptr addrspace(5) %b to i32
+// CHECK: {{.*}}call i32 @llvm.amdgcn.ds.bpermute(i32 %a, i32 [[P2I]])
+// CHECK-NOT: lshr
+void test_private_ptr(global int* out, int a, private int* b) {
+  private int* res = __builtin_amdgcn_ds_bpermute(a, b);
+  *out = (int)(long)res;
+}
diff --git a/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip 
b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
new file mode 100644
index 0000000000000..9823bc1dc6719
--- /dev/null
+++ b/clang/test/SemaHIP/builtin-amdgcn-ds-bpermute.hip
@@ -0,0 +1,37 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -fsyntax-only -verify -triple amdgcn-amd-amdhsa -aux-triple 
x86_64-pc-linux-gnu -Wno-unused-value %s
+// RUN: %clang_cc1 -fsyntax-only -verify -triple x86_64-pc-linux-gnu 
-aux-triple amdgcn -Wno-unused-value %s
+#define __device__ __attribute__((device))
+
+typedef short short2 __attribute__((ext_vector_type(2)));
+typedef int int2 __attribute__((ext_vector_type(2)));
+
+struct A { int a; };
+
+__device__ void test_invalid_num_args(int a, int b, int c) {
+   __builtin_amdgcn_ds_bpermute(a, b, c); // expected-error {{too many 
arguments to function call, expected 2, have 3}}
+}
+
+__device__ void test_invalid_index(short2 a, int b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-error {{converting 'short2' 
(vector of 2 'short' values) to incompatible type 'int'}}
+}
+
+__device__ void test_warn_long_index(long a, int b) {
+   __builtin_amdgcn_ds_bpermute(a, b); // expected-warning {{the 64-bit 
argument will be truncated to 32 bits in this call to 
__builtin_amdgcn_ds_bpermute}}
+}
+
+__device__ void test_vector_src(int a, int2 b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+__device__ void test_long_src(int a, long b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+__device__ void test_pointer_src(int a, void* b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
+}
+
+__device__ void test_struct_src(int a, A b) {
+   __builtin_amdgcn_ds_bpermute(a, b);
+}
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 66591519de73e..94af13d2ffc48 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2779,7 +2779,6 @@ def int_amdgcn_ds_permute :
 
 // llvm.amdgcn.ds.bpermute <index> <src>
 def int_amdgcn_ds_bpermute :
-  ClangBuiltin<"__builtin_amdgcn_ds_bpermute">,
   Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
      [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_ds_bpermute argument types (PR #153501)

Reply via email to