https://github.com/yxsamliu created
https://github.com/llvm/llvm-project/pull/201607
This relands the compiler portion of #177665 ("[PGO][AMDGPU] Add basic HIP
offload PGO support"), which was reverted in #201416, and adds the driver
wiring needed for the new separate device-profile runtime library.
The first commit restores the reverted compiler state verbatim: the AMDGPU
instrumentation lowering in LLVM (InstrProfiling.cpp and
PGOInstrumentation.cpp), the HIP device-variable and shadow registration in
Clang codegen (CGCUDANV.cpp), the supporting InstrProf.h and RuntimeLibcalls.td
changes, and their tests. This part depends on the runtime ABI provided by the
compiler-rt change, which is relanded separately.
The second commit adds the driver wiring to link the ROCm device-profile
collection runtime. When building HIP with device PGO (-fprofile-generate or
-fprofile-instr-generate) the host program must drain device-side profile
counters. That collection runtime now lives in the separate
clang_rt.profile_rocm library, so the driver must link it into the host image
in addition to the base clang_rt.profile. This is done in addOffloadRTLibs,
which already knows the active offload kinds and links libamdhip64, for both
the Linux/GNU and MSVC host toolchains. It is gated on the HIP offload kind,
needsProfileRT, and the library being present in the resource directory.
Non-HIP profile links and HIP links without profiling are unaffected.
This change depends on the compiler-rt change that adds clang_rt.profile_rocm.
>From ac95c63cf43e531a61c446b94d50531da22de876 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <[email protected]>
Date: Thu, 4 Jun 2026 10:28:01 -0400
Subject: [PATCH 1/2] Reland HIP offload PGO compiler support (LLVM/Clang part
of #177665)
Relands the compiler portion of #177665 ("[PGO][AMDGPU] Add basic HIP offload
PGO support"), which was reverted in #201416: the AMDGPU instrumentation
lowering in LLVM (InstrProfiling.cpp / PGOInstrumentation.cpp), the HIP
device-variable / shadow registration in Clang codegen (CGCUDANV.cpp), the
supporting InstrProf.h / RuntimeLibcalls.td changes, and their tests.
This commit restores the reverted state verbatim. It depends on the runtime ABI
provided by the compiler-rt part (relanded separately); the next commit adds the
driver wiring to link the ROCm device-profile library for HIP
-fprofile-generate.
---
clang/lib/CodeGen/CGCUDANV.cpp | 152 ++++++++++++++++++
.../test/CodeGenHIP/offload-pgo-sections.hip | 50 ++++++
llvm/include/llvm/IR/RuntimeLibcalls.td | 9 +-
llvm/include/llvm/ProfileData/InstrProf.h | 5 +
.../Instrumentation/InstrProfiling.cpp | 109 ++++++++++---
.../Instrumentation/PGOInstrumentation.cpp | 3 +-
.../InstrProfiling/amdgpu-instrumentation.ll | 32 ++++
.../InstrProfiling/amdgpu-profc-arrays.ll | 26 +++
.../InstrProfiling/gpu-weak.ll | 36 +++++
.../amdgpu-disable-value-profiling.ll | 22 +++
10 files changed, 415 insertions(+), 29 deletions(-)
create mode 100644 clang/test/CodeGenHIP/offload-pgo-sections.hip
create mode 100644
llvm/test/Instrumentation/InstrProfiling/amdgpu-instrumentation.ll
create mode 100644
llvm/test/Instrumentation/InstrProfiling/amdgpu-profc-arrays.ll
create mode 100644 llvm/test/Instrumentation/InstrProfiling/gpu-weak.ll
create mode 100644
llvm/test/Transforms/PGOProfile/amdgpu-disable-value-profiling.ll
diff --git a/clang/lib/CodeGen/CGCUDANV.cpp b/clang/lib/CodeGen/CGCUDANV.cpp
index 259b6c040706b..65f398af7902b 100644
--- a/clang/lib/CodeGen/CGCUDANV.cpp
+++ b/clang/lib/CodeGen/CGCUDANV.cpp
@@ -28,6 +28,7 @@
#include "llvm/IR/ReplaceConstant.h"
#include "llvm/Support/Format.h"
#include "llvm/Support/VirtualFileSystem.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
using namespace clang;
using namespace CodeGen;
@@ -72,6 +73,11 @@ class CGNVCUDARuntime : public CGCUDARuntime {
/// ModuleCtorFunction() and used to create corresponding cleanup calls in
/// ModuleDtorFunction()
llvm::GlobalVariable *GpuBinaryHandle = nullptr;
+ /// Host-side shadow for the per-TU __llvm_profile_sections_<CUID> global,
+ /// emitted only for HIP host compiles when PGO is on. Registered via
+ /// __hipRegisterVar (non-RDC) or an offloading entry (RDC) so the runtime
+ /// can locate the device-side table by name.
+ llvm::GlobalVariable *OffloadProfShadow = nullptr;
/// Whether we generate relocatable device code.
bool RelocatableDeviceCode;
/// Mangle context for device.
@@ -176,6 +182,13 @@ class CGNVCUDARuntime : public CGCUDARuntime {
void transformManagedVars();
/// Create offloading entries to register globals in RDC mode.
void createOffloadingEntries();
+ /// For HIP+PGO, emit the per-TU __llvm_profile_sections_<CUID> global.
+ /// On the device side it is the populated 7-pointer section-bounds table.
+ /// On the host side it is a placeholder void* shadow stored in
+ /// OffloadProfShadow, registered later by makeRegisterGlobalsFn (non-RDC)
+ /// or createOffloadingEntries (RDC) so the runtime can locate the
+ /// device-side table by name.
+ void emitOffloadProfilingSections();
public:
CGNVCUDARuntime(CodeGenModule &CGM);
@@ -735,6 +748,32 @@ llvm::Function *CGNVCUDARuntime::makeRegisterGlobalsFn() {
}
}
+ // Register the per-TU offload-profiling shadow so the host runtime can
+ // locate the matching device-side __llvm_profile_sections_<CUID>. We
+ // emit both __hipRegisterVar (so the HIP runtime can map the host
+ // shadow to the device symbol) and
+ // __llvm_profile_offload_register_shadow_variable (so the profile
+ // runtime adds the shadow to its drain list).
+ if (OffloadProfShadow) {
+ llvm::Constant *Name =
+ makeConstantString(std::string(OffloadProfShadow->getName()));
+ llvm::Value *RegisterVarArgs[] = {
+ &GpuBinaryHandlePtr,
+ OffloadProfShadow,
+ Name,
+ Name,
+ llvm::ConstantInt::get(IntTy, /*Extern=*/0),
+ llvm::ConstantInt::get(VarSizeTy,
CGM.getDataLayout().getPointerSize()),
+ llvm::ConstantInt::get(IntTy, /*Constant=*/0),
+ llvm::ConstantInt::get(IntTy, 0)};
+ Builder.CreateCall(RegisterVar, RegisterVarArgs);
+
+ llvm::FunctionCallee RegisterShadow = CGM.CreateRuntimeFunction(
+ llvm::FunctionType::get(VoidTy, {PtrTy}, false),
+ "__llvm_profile_offload_register_shadow_variable");
+ Builder.CreateCall(RegisterShadow, {OffloadProfShadow});
+ }
+
Builder.CreateRetVoid();
return RegisterKernelsFunc;
}
@@ -1256,11 +1295,124 @@ void CGNVCUDARuntime::createOffloadingEntries() {
I.Flags.getSurfTexType());
}
}
+
+ // Register the per-TU offload-profiling shadow. The offloading entry
+ // makes the linker-wrapper emit the host __hipRegisterVar call in the
+ // combined ctor. Separately emit a per-TU ctor that registers the
+ // shadow with the profile runtime's drain list.
+ if (OffloadProfShadow) {
+ llvm::offloading::emitOffloadingEntry(
+ M, Kind, OffloadProfShadow, OffloadProfShadow->getName(),
+ CGM.getDataLayout().getPointerSize(),
+ llvm::offloading::OffloadGlobalEntry, /*Data=*/0);
+
+ llvm::LLVMContext &Ctx = M.getContext();
+ auto *PtrTy = llvm::PointerType::getUnqual(Ctx);
+ llvm::FunctionCallee RegisterShadow = CGM.CreateRuntimeFunction(
+ llvm::FunctionType::get(VoidTy, {PtrTy}, false),
+ "__llvm_profile_offload_register_shadow_variable");
+ auto *CtorFn = llvm::Function::Create(
+ llvm::FunctionType::get(VoidTy, false),
+ llvm::GlobalValue::InternalLinkage,
+ "__llvm_profile_register_shadow." + CGM.getContext().getCUIDHash(),
&M);
+ auto *Entry = llvm::BasicBlock::Create(Ctx, "entry", CtorFn);
+ llvm::IRBuilder<> B(Entry);
+ B.CreateCall(RegisterShadow, {OffloadProfShadow});
+ B.CreateRetVoid();
+ llvm::appendToGlobalCtors(M, CtorFn, /*Priority=*/65535);
+ }
+}
+
+// For HIP host+device compiles with PGO enabled, emit the per-TU global
+// __llvm_profile_sections_<CUID>. Device side: a 7-pointer struct holding
+// section start/stop bounds for the names/counters/data sections plus the
+// raw-version variable. Host side: an opaque void* shadow whose only
+// purpose is to give the host-runtime a registered symbol name to look up
+// via hipGetSymbolAddress; the actual device-side data lives in the
+// matching device-side global.
+void CGNVCUDARuntime::emitOffloadProfilingSections() {
+ if (!CGM.getLangOpts().HIP)
+ return;
+ if (!CGM.getCodeGenOpts().hasProfileInstr())
+ return;
+
+ StringRef CUIDHash = CGM.getContext().getCUIDHash();
+ if (CUIDHash.empty())
+ return;
+
+ llvm::Module &M = CGM.getModule();
+ llvm::LLVMContext &Ctx = M.getContext();
+ std::string Name = ("__llvm_profile_sections_" + CUIDHash).str();
+
+ // If the global already exists (e.g. another TU was merged in), don't
+ // duplicate it.
+ if (M.getNamedValue(Name))
+ return;
+
+ if (CGM.getLangOpts().CUDAIsDevice) {
+ // Device side: emit the populated struct. Section start/stop symbols
+ // are linker-defined (ELF auto-generates __start_/__stop_ for any
+ // section whose name is a valid C identifier; AMDGPU is ELF).
+ unsigned GlobalAS = M.getDataLayout().getDefaultGlobalsAddressSpace();
+ auto *PtrTy = llvm::PointerType::get(Ctx, GlobalAS);
+ auto getOrDeclare = [&](StringRef SymName) {
+ if (auto *GV = M.getNamedGlobal(SymName))
+ return GV;
+ auto *GV = new llvm::GlobalVariable(
+ M, llvm::Type::getInt8Ty(Ctx), /*isConstant=*/false,
+ llvm::GlobalValue::ExternalLinkage, /*Initializer=*/nullptr, SymName,
+ /*InsertBefore=*/nullptr, llvm::GlobalValue::NotThreadLocal,
+ GlobalAS);
+ GV->setVisibility(llvm::GlobalValue::HiddenVisibility);
+ return GV;
+ };
+ auto *VersionGV = M.getNamedGlobal("__llvm_profile_raw_version");
+ if (!VersionGV) {
+ VersionGV = new llvm::GlobalVariable(
+ M, llvm::Type::getInt64Ty(Ctx), /*isConstant=*/true,
+ llvm::GlobalValue::ExternalLinkage, /*Initializer=*/nullptr,
+ "__llvm_profile_raw_version",
+ /*InsertBefore=*/nullptr, llvm::GlobalValue::NotThreadLocal,
+ GlobalAS);
+ }
+
+ auto *StructTy = llvm::StructType::get(
+ Ctx, {PtrTy, PtrTy, PtrTy, PtrTy, PtrTy, PtrTy, PtrTy});
+ llvm::Constant *Fields[] = {
+ getOrDeclare("__start___llvm_prf_names"),
+ getOrDeclare("__stop___llvm_prf_names"),
+ getOrDeclare("__start___llvm_prf_cnts"),
+ getOrDeclare("__stop___llvm_prf_cnts"),
+ getOrDeclare("__start___llvm_prf_data"),
+ getOrDeclare("__stop___llvm_prf_data"),
+ VersionGV,
+ };
+ auto *Init = llvm::ConstantStruct::get(StructTy, Fields);
+ auto *GV = new llvm::GlobalVariable(
+ M, StructTy, /*isConstant=*/true, llvm::GlobalValue::ExternalLinkage,
+ Init, Name, /*InsertBefore=*/nullptr,
llvm::GlobalValue::NotThreadLocal,
+ GlobalAS);
+ GV->setVisibility(llvm::GlobalValue::ProtectedVisibility);
+ CGM.addCompilerUsedGlobal(GV);
+ return;
+ }
+
+ // Host side: emit an opaque void* shadow. Layout doesn't matter — the
+ // runtime locates it by name via hipGetSymbolAddress and treats it as
+ // the address of the device-side struct. Registration with the HIP
+ // runtime is added by makeRegisterGlobalsFn (non-RDC) or
+ // createOffloadingEntries (RDC).
+ auto *PtrTy = llvm::PointerType::getUnqual(Ctx);
+ OffloadProfShadow = new llvm::GlobalVariable(
+ M, PtrTy, /*isConstant=*/false, llvm::GlobalValue::ExternalLinkage,
+ llvm::ConstantPointerNull::get(PtrTy), Name);
+ CGM.addCompilerUsedGlobal(OffloadProfShadow);
}
// Returns module constructor to be added.
llvm::Function *CGNVCUDARuntime::finalizeModule() {
transformManagedVars();
+ emitOffloadProfilingSections();
if (CGM.getLangOpts().CUDAIsDevice) {
// Mark ODR-used device variables as compiler used to prevent it from being
// eliminated by optimization. This is necessary for device variables
diff --git a/clang/test/CodeGenHIP/offload-pgo-sections.hip
b/clang/test/CodeGenHIP/offload-pgo-sections.hip
new file mode 100644
index 0000000000000..17c6fe7b9e609
--- /dev/null
+++ b/clang/test/CodeGenHIP/offload-pgo-sections.hip
@@ -0,0 +1,50 @@
+// REQUIRES: amdgpu-registered-target
+// REQUIRES: x86-registered-target
+
+// Verify CGCUDANV emits the per-TU __llvm_profile_sections_<CUID> global
+// for HIP+PGO compilations. Device subcompile: populated 7-pointer struct
+// in addrspace(1). Host compile: void* shadow registered with the HIP
+// runtime and with the profile runtime's drain list.
+
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -cuid=abc \
+// RUN: -fprofile-instrument=clang -emit-llvm -o - -x hip %s \
+// RUN: | FileCheck -check-prefix=DEV %s
+
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -cuid=abc \
+// RUN: -fprofile-instrument=clang -emit-llvm -o - -x hip %s \
+// RUN: | FileCheck -check-prefix=HOST %s
+
+// Guard: no PGO -> no emission.
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -cuid=abc \
+// RUN: -emit-llvm -o - -x hip %s \
+// RUN: | FileCheck -check-prefix=NONE %s
+
+// Guard: no CUID -> no emission.
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device \
+// RUN: -fprofile-instrument=clang -emit-llvm -o - -x hip %s \
+// RUN: | FileCheck -check-prefix=NONE %s
+
+#define __device__ __attribute__((device))
+#define __global__ __attribute__((global))
+
+__device__ int helper(int x) { return x + 1; }
+__global__ void kernel(int *p) { *p = helper(*p); }
+
+// DEV-DAG: @__start___llvm_prf_names = external hidden addrspace(1) global i8
+// DEV-DAG: @__stop___llvm_prf_names = external hidden addrspace(1) global i8
+// DEV-DAG: @__start___llvm_prf_cnts = external hidden addrspace(1) global i8
+// DEV-DAG: @__stop___llvm_prf_cnts = external hidden addrspace(1) global i8
+// DEV-DAG: @__start___llvm_prf_data = external hidden addrspace(1) global i8
+// DEV-DAG: @__stop___llvm_prf_data = external hidden addrspace(1) global i8
+// DEV-DAG: @__llvm_profile_raw_version = external addrspace(1) constant i64
+// DEV: @__llvm_profile_sections_[[CUID:[0-9a-f]+]] = protected addrspace(1)
constant
{{.*}}@__start___llvm_prf_names{{.*}}@__stop___llvm_prf_names{{.*}}@__start___llvm_prf_cnts{{.*}}@__stop___llvm_prf_cnts{{.*}}@__start___llvm_prf_data{{.*}}@__stop___llvm_prf_data{{.*}}@__llvm_profile_raw_version
+// DEV: @llvm.compiler.used = {{.*}}@__llvm_profile_sections_[[CUID]]
+
+// HOST: @__llvm_profile_sections_[[CUID:[0-9a-f]+]] = global ptr null
+// HOST: @llvm.compiler.used = {{.*}}@__llvm_profile_sections_[[CUID]]
+// HOST: define internal void @__hip_register_globals
+// HOST: call void @__hipRegisterVar({{.*}}@__llvm_profile_sections_[[CUID]],
+// HOST: call void @__llvm_profile_offload_register_shadow_variable(ptr
@__llvm_profile_sections_[[CUID]])
+
+// NONE-NOT: __llvm_profile_sections_
+// NONE-NOT: __llvm_profile_offload_register_shadow_variable
diff --git a/llvm/include/llvm/IR/RuntimeLibcalls.td
b/llvm/include/llvm/IR/RuntimeLibcalls.td
index cbdc48a9a717f..37bad559f49e7 100644
--- a/llvm/include/llvm/IR/RuntimeLibcalls.td
+++ b/llvm/include/llvm/IR/RuntimeLibcalls.td
@@ -500,6 +500,9 @@ def DEOPTIMIZE : RuntimeLibcall;
// Return address
def RETURN_ADDRESS : RuntimeLibcall;
+// GPU profiling
+def PROFILE_INSTRUMENT_GPU : RuntimeLibcall;
+
// Clear cache
def CLEAR_CACHE : RuntimeLibcall;
def RISCV_FLUSH_ICACHE : RuntimeLibcall;
@@ -2241,8 +2244,10 @@ def WindowsARM64ECSystemLibrary
def isAMDGPU : RuntimeLibcallPredicate<"TT.isAMDGPU()">;
-// No calls.
-def AMDGPUSystemLibrary : SystemRuntimeLibrary<isAMDGPU, (add)>;
+def __llvm_profile_instrument_gpu : RuntimeLibcallImpl<PROFILE_INSTRUMENT_GPU>;
+
+def AMDGPUSystemLibrary
+ : SystemRuntimeLibrary<isAMDGPU, (add __llvm_profile_instrument_gpu)>;
//===----------------------------------------------------------------------===//
// ARM Runtime Libcalls
diff --git a/llvm/include/llvm/ProfileData/InstrProf.h
b/llvm/include/llvm/ProfileData/InstrProf.h
index dffc58281c2d9..b7f917ec39b70 100644
--- a/llvm/include/llvm/ProfileData/InstrProf.h
+++ b/llvm/include/llvm/ProfileData/InstrProf.h
@@ -122,6 +122,11 @@ inline StringRef getInstrProfValueProfMemOpFuncName() {
/// Return the prefix of the name of the variables to function as a filter.
inline StringRef getInstrProfVarPrefix() { return "__prof"; }
+/// Return the name of the GPU wave-cooperative counter increment helper.
+inline StringRef getInstrProfInstrumentGPUFuncName() {
+ return INSTR_PROF_INSTRUMENT_GPU_FUNC_STR;
+}
+
/// Return the name prefix of variables containing instrumented function names.
inline StringRef getInstrProfNameVarPrefix() { return "__profn_"; }
diff --git a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
index dabd495cddd49..8e4ba41919768 100644
--- a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
+++ b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
@@ -23,6 +23,7 @@
#include "llvm/Analysis/CFG.h"
#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/TargetLibraryInfo.h"
+#include "llvm/Frontend/Offloading/Utility.h"
#include "llvm/IR/Attributes.h"
#include "llvm/IR/BasicBlock.h"
#include "llvm/IR/CFG.h"
@@ -33,14 +34,17 @@
#include "llvm/IR/DiagnosticInfo.h"
#include "llvm/IR/Dominators.h"
#include "llvm/IR/Function.h"
+#include "llvm/IR/GlobalAlias.h"
#include "llvm/IR/GlobalValue.h"
#include "llvm/IR/GlobalVariable.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/MDBuilder.h"
#include "llvm/IR/Module.h"
+#include "llvm/IR/RuntimeLibcalls.h"
#include "llvm/IR/Type.h"
#include "llvm/Pass.h"
#include "llvm/ProfileData/InstrProf.h"
@@ -287,6 +291,8 @@ class InstrLowerer final {
GlobalVariable *NamesVar = nullptr;
size_t NamesSize = 0;
+ StructType *ProfileDataTy = nullptr;
+
// vector of counter load/store pairs to be register promoted.
std::vector<LoadStorePair> PromotionCandidates;
@@ -407,6 +413,9 @@ class InstrLowerer final {
/// Create a static initializer for our data, on platforms that need it,
/// and for any profile output file that was specified.
void emitInitialization();
+
+ /// Return the __llvm_profile_data struct type.
+ StructType *getProfileDataTy();
};
///
@@ -1190,19 +1199,22 @@ void InstrLowerer::lowerTimestamp(
void InstrLowerer::lowerIncrement(InstrProfIncrementInst *Inc) {
auto *Addr = getCounterAddress(Inc);
-
IRBuilder<> Builder(Inc);
if (isGPUProfTarget(M)) {
- auto *I64Ty = Builder.getInt64Ty();
+ auto *Int64Ty = Builder.getInt64Ty();
auto *PtrTy = Builder.getPtrTy();
auto *CalleeTy = FunctionType::get(Type::getVoidTy(M.getContext()),
- {PtrTy, PtrTy, I64Ty}, false);
- auto Callee =
- M.getOrInsertFunction("__llvm_profile_instrument_gpu", CalleeTy);
+ {PtrTy, PtrTy, Int64Ty}, false);
+ FunctionCallee Callee =
+ M.getOrInsertFunction(RTLIB::RuntimeLibcallsInfo::getLibcallImplName(
+ RTLIB::impl___llvm_profile_instrument_gpu),
+ CalleeTy);
Value *CastAddr = Builder.CreatePointerBitCastOrAddrSpaceCast(Addr, PtrTy);
Value *Uniform =
ConstantPointerNull::get(PointerType::getUnqual(M.getContext()));
- Builder.CreateCall(Callee, {CastAddr, Uniform, Inc->getStep()});
+ Value *StepI64 =
+ Builder.CreateZExtOrTrunc(Inc->getStep(), Int64Ty, "step.i64");
+ Builder.CreateCall(Callee, {CastAddr, Uniform, StepI64});
} else if (Options.Atomic || AtomicCounterUpdateAll ||
(Inc->getIndex()->isNullValue() && AtomicFirstCounter)) {
Builder.CreateAtomicRMW(AtomicRMWInst::Add, Addr, Inc->getStep(),
@@ -1400,6 +1412,12 @@ static inline Constant *getFuncAddrForProfData(Function
*Fn) {
if (shouldUsePublicSymbol(Fn))
return Fn;
+ // For GPU targets, weak functions cannot use private aliases because
+ // LTO may pick a different TU's copy, leaving the alias undefined
+ if (isGPUProfTarget(*Fn->getParent()) &&
+ GlobalValue::isWeakForLinker(Fn->getLinkage()))
+ return Fn;
+
// When possible use a private alias to avoid symbolic relocations.
auto *GA = GlobalAlias::create(GlobalValue::LinkageTypes::PrivateLinkage,
Fn->getName() + ".local", Fn);
@@ -1623,11 +1641,15 @@ GlobalVariable
*InstrLowerer::setupProfileSection(InstrProfInstBase *Inc,
}
Ptr->setVisibility(Visibility);
- // Put the counters and bitmaps in their own sections so linkers can
- // remove unneeded sections.
Ptr->setSection(getInstrProfSectionName(IPSK, TT.getObjectFormat()));
Ptr->setLinkage(Linkage);
- maybeSetComdat(Ptr, Fn, VarName);
+ if (isGPUProfTarget(M) && !Ptr->hasComdat()) {
+ Ptr->setComdat(M.getOrInsertComdat(VarName));
+ Ptr->setLinkage(GlobalValue::LinkOnceODRLinkage);
+ Ptr->setVisibility(GlobalValue::ProtectedVisibility);
+ } else {
+ maybeSetComdat(Ptr, Fn, VarName);
+ }
return Ptr;
}
@@ -1799,7 +1821,8 @@ void
InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
}
uint64_t NumCounters = Inc->getNumCounters()->getZExtValue();
- auto *CounterPtr = PD.RegionCounters;
+
+ Constant *CounterPtr = PD.RegionCounters;
uint64_t NumBitmapBytes = PD.NumBitmapBytes;
@@ -1807,11 +1830,7 @@ void
InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
auto *IntPtrTy = M.getDataLayout().getIntPtrType(M.getContext());
auto *Int16Ty = Type::getInt16Ty(Ctx);
auto *Int16ArrayTy = ArrayType::get(Int16Ty, IPVK_Last + 1);
- Type *DataTypes[] = {
-#define INSTR_PROF_DATA(Type, LLVMType, Name, Init) LLVMType,
-#include "llvm/ProfileData/InstrProfData.inc"
- };
- auto *DataTy = StructType::get(Ctx, ArrayRef(DataTypes));
+ auto *DataTy = getProfileDataTy();
Constant *FunctionAddr = getFuncAddrForProfData(Fn);
@@ -1819,6 +1838,15 @@ void
InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
for (uint32_t Kind = IPVK_First; Kind <= IPVK_Last; ++Kind)
Int16ArrayVals[Kind] = ConstantInt::get(Int16Ty, PD.NumValueSites[Kind]);
+ if (isGPUProfTarget(M)) {
+ // For GPU targets, weak functions need weak linkage for their profile data
+ // aliases to allow linker deduplication across TUs
+ if (GlobalValue::isWeakForLinker(Fn->getLinkage()))
+ Linkage = Fn->getLinkage();
+ else
+ Linkage = GlobalValue::ExternalLinkage;
+ Visibility = GlobalValue::ProtectedVisibility;
+ }
// If the data variable is not referenced by code (if we don't emit
// @llvm.instrprof.value.profile, NS will be 0), and the counter keeps the
// data variable live under linker GC, the data variable can be private. This
@@ -1830,19 +1858,22 @@ void
InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
// If profd is in a deduplicate comdat, NS==0 with a hash suffix guarantees
// that other copies must have the same CFG and cannot have value profiling.
// If no hash suffix, other profd copies may be referenced by code.
- if (NS == 0 && !(DataReferencedByCode && NeedComdat && !Renamed) &&
+ if (!isGPUProfTarget(M) && NS == 0 &&
+ !(DataReferencedByCode && NeedComdat && !Renamed) &&
(TT.isOSBinFormatELF() ||
(!DataReferencedByCode && TT.isOSBinFormatCOFF()))) {
Linkage = GlobalValue::PrivateLinkage;
Visibility = GlobalValue::DefaultVisibility;
}
- // AMDGPU objects are always ET_DYN, so non-local symbols with default
- // visibility are preemptible. The CounterPtr label difference emits a REL32
- // relocation that lld rejects against preemptible targets.
- if (TT.isAMDGPU() && !GlobalValue::isLocalLinkage(Linkage))
+ // GPU-target ELF objects are always ET_DYN, so non-local symbols with
+ // default visibility are preemptible. The CounterPtr label difference
+ // emits a REL32 relocation that lld rejects against preemptible targets.
+ if (TT.isGPU() && TT.isOSBinFormatELF() &&
+ !GlobalValue::isLocalLinkage(Linkage))
Visibility = GlobalValue::ProtectedVisibility;
auto *Data =
new GlobalVariable(M, DataTy, false, Linkage, nullptr, DataVarName);
+
Constant *RelativeCounterPtr;
GlobalVariable *BitmapPtr = PD.RegionBitmaps;
Constant *RelativeBitmapPtr = ConstantInt::get(IntPtrTy, 0);
@@ -1883,7 +1914,12 @@ void
InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
Data->setSection(
getInstrProfSectionName(DataSectionKind, TT.getObjectFormat()));
Data->setAlignment(Align(INSTR_PROF_DATA_ALIGNMENT));
- maybeSetComdat(Data, Fn, CntsVarName);
+ if (isGPUProfTarget(M) && !Data->hasComdat()) {
+ Data->setComdat(M.getOrInsertComdat(CntsVarName));
+ Data->setLinkage(GlobalValue::LinkOnceODRLinkage);
+ } else {
+ maybeSetComdat(Data, Fn, CntsVarName);
+ }
PD.DataVar = Data;
@@ -1961,16 +1997,18 @@ void InstrLowerer::emitNameData() {
auto &Ctx = M.getContext();
auto *NamesVal =
ConstantDataArray::getString(Ctx, StringRef(CompressedNameStr), false);
- NamesVar = new GlobalVariable(M, NamesVal->getType(), true,
- GlobalValue::PrivateLinkage, NamesVal,
- getInstrProfNamesVarName());
+ std::string NamesVarName = std::string(getInstrProfNamesVarName());
+ NamesVar =
+ new GlobalVariable(M, NamesVal->getType(), true,
+ GlobalValue::PrivateLinkage, NamesVal, NamesVarName);
NamesSize = CompressedNameStr.size();
setGlobalVariableLargeSection(TT, *NamesVar);
- NamesVar->setSection(
+ std::string NamesSectionName =
ProfileCorrelate == InstrProfCorrelator::BINARY
? getInstrProfSectionName(IPSK_covname, TT.getObjectFormat())
- : getInstrProfSectionName(IPSK_name, TT.getObjectFormat()));
+ : getInstrProfSectionName(IPSK_name, TT.getObjectFormat());
+ NamesVar->setSection(NamesSectionName);
// On COFF, it's important to reduce the alignment down to 1 to prevent the
// linker from inserting padding before the start of the names section or
// between names entries.
@@ -2179,3 +2217,22 @@ void createProfileSamplingVar(Module &M) {
appendToCompilerUsed(M, SamplingVar);
}
} // namespace llvm
+
+// For GPU targets: Allocate contiguous arrays for all profile data.
+// This solves the linker reordering problem by using ONE symbol per section
+// type, so there's nothing for the linker to reorder.
+StructType *InstrLowerer::getProfileDataTy() {
+ if (ProfileDataTy)
+ return ProfileDataTy;
+
+ auto &Ctx = M.getContext();
+ auto *IntPtrTy = M.getDataLayout().getIntPtrType(M.getContext());
+ auto *Int16Ty = Type::getInt16Ty(Ctx);
+ auto *Int16ArrayTy = ArrayType::get(Int16Ty, IPVK_Last + 1);
+ Type *DataTypes[] = {
+#define INSTR_PROF_DATA(Type, LLVMType, Name, Init) LLVMType,
+#include "llvm/ProfileData/InstrProfData.inc"
+ };
+ ProfileDataTy = StructType::get(Ctx, ArrayRef(DataTypes));
+ return ProfileDataTy;
+}
diff --git a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
index db032d6fcad45..b6d07aa821e7f 100644
--- a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
+++ b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
@@ -380,7 +380,8 @@ class FunctionInstrumenter final {
// another counter range within the context.
bool isValueProfilingDisabled() const {
return DisableValueProfiling ||
- InstrumentationType == PGOInstrumentationType::CTXPROF;
+ InstrumentationType == PGOInstrumentationType::CTXPROF ||
+ M.getTargetTriple().isGPU();
}
bool shouldInstrumentEntryBB() const {
diff --git a/llvm/test/Instrumentation/InstrProfiling/amdgpu-instrumentation.ll
b/llvm/test/Instrumentation/InstrProfiling/amdgpu-instrumentation.ll
new file mode 100644
index 0000000000000..efe53ab1ebdfb
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/amdgpu-instrumentation.ll
@@ -0,0 +1,32 @@
+;; Test basic AMDGPU PGO instrumentation lowering.
+;; Verifies that each instrumentation point lowers directly to a call to
+;; __llvm_profile_instrument_gpu with a null uniform-counter argument.
+
+; RUN: opt %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof -S | FileCheck %s
+
+@__hip_cuid_test01 = addrspace(1) global i8 0
+@__profn_test_kernel = private constant [11 x i8] c"test_kernel"
+
+define amdgpu_kernel void @test_kernel(ptr addrspace(1) %out, i32 %n) {
+entry:
+ call void @llvm.instrprof.increment(ptr @__profn_test_kernel, i64 111, i32
4, i32 0)
+ %cmp = icmp sgt i32 %n, 0
+ br i1 %cmp, label %if.then, label %if.end
+
+if.then:
+ call void @llvm.instrprof.increment(ptr @__profn_test_kernel, i64 111, i32
4, i32 1)
+ store i32 1, ptr addrspace(1) %out
+ br label %if.end
+
+if.end:
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+
+; CHECK-LABEL: define {{.*}} @test_kernel
+; CHECK-NOT: @__llvm_profile_sampling_gpu
+; CHECK: call void @__llvm_profile_instrument_gpu(
+; CHECK-SAME: ptr addrspacecast (ptr addrspace(1) @__profc_test_kernel to
ptr), ptr null, i64 1)
+; CHECK: call void @__llvm_profile_instrument_gpu(
+; CHECK-SAME: ptr addrspacecast (ptr addrspace(1) getelementptr inbounds ([4 x
i64], ptr addrspace(1) @__profc_test_kernel, i32 0, i32 1) to ptr), ptr null,
i64 1)
diff --git a/llvm/test/Instrumentation/InstrProfiling/amdgpu-profc-arrays.ll
b/llvm/test/Instrumentation/InstrProfiling/amdgpu-profc-arrays.ll
new file mode 100644
index 0000000000000..eab78fb3591b1
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/amdgpu-profc-arrays.ll
@@ -0,0 +1,26 @@
+;; Per-kernel __profc_* arrays land in section __llvm_prf_cnts with one slot
+;; per counter, and counter increments lower to __llvm_profile_instrument_gpu
+;; calls whose pointer argument is a GEP into the per-kernel array.
+
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=instrprof < %s | FileCheck %s
+
+@__profn_kernel1 = private constant [7 x i8] c"kernel1"
+@__profn_kernel2 = private constant [7 x i8] c"kernel2"
+
+; CHECK: @__profc_kernel1 = linkonce_odr protected addrspace(1) global [2 x
i64] zeroinitializer, section "__llvm_prf_cnts"
+; CHECK: @__profc_kernel2 = linkonce_odr protected addrspace(1) global [1 x
i64] zeroinitializer, section "__llvm_prf_cnts"
+
+define amdgpu_kernel void @kernel1() {
+ call void @llvm.instrprof.increment(ptr @__profn_kernel1, i64 12345, i32 2,
i32 0)
+ call void @llvm.instrprof.increment(ptr @__profn_kernel1, i64 12345, i32 2,
i32 1)
+ ret void
+}
+
+define amdgpu_kernel void @kernel2() {
+ call void @llvm.instrprof.increment(ptr @__profn_kernel2, i64 67890, i32 1,
i32 0)
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+
+; CHECK: call void @__llvm_profile_instrument_gpu(ptr addrspacecast (ptr
addrspace(1) getelementptr inbounds ([2 x i64], ptr addrspace(1)
@__profc_kernel1, i32 0, i32 1) to ptr), ptr null, i64 1)
diff --git a/llvm/test/Instrumentation/InstrProfiling/gpu-weak.ll
b/llvm/test/Instrumentation/InstrProfiling/gpu-weak.ll
new file mode 100644
index 0000000000000..ce16f1ee3215f
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/gpu-weak.ll
@@ -0,0 +1,36 @@
+; RUN: opt < %s -passes=instrprof -S | FileCheck %s
+
+; Test that weak functions on GPU targets get weak linkage for their
+; __profd_ aliases to allow linker deduplication across TUs.
+; Non-weak functions get external linkage (default for aliases).
+
+target triple = "amdgcn-amd-amdhsa"
+
+@__hip_cuid_abc123 = addrspace(1) global i8 0
+
+; AMDGPU GPU profiling lowers to per-function comdat globals (not aliases).
+; CHECK: @__profd_weak_func = linkonce_odr protected addrspace(1) global
+@__profn_weak_func = private constant [9 x i8] c"weak_func"
+
+define weak void @weak_func() {
+ call void @llvm.instrprof.increment(ptr @__profn_weak_func, i64 0, i32 1,
i32 0)
+ ret void
+}
+
+; CHECK: @__profd_weak_odr_func = linkonce_odr protected addrspace(1) global
+@__profn_weak_odr_func = private constant [13 x i8] c"weak_odr_func"
+
+define weak_odr void @weak_odr_func() {
+ call void @llvm.instrprof.increment(ptr @__profn_weak_odr_func, i64 0, i32
1, i32 0)
+ ret void
+}
+
+; CHECK: @__profd_normal_func = linkonce_odr protected addrspace(1) global
+@__profn_normal_func = private constant [11 x i8] c"normal_func"
+
+define void @normal_func() {
+ call void @llvm.instrprof.increment(ptr @__profn_normal_func, i64 0, i32 1,
i32 0)
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
diff --git a/llvm/test/Transforms/PGOProfile/amdgpu-disable-value-profiling.ll
b/llvm/test/Transforms/PGOProfile/amdgpu-disable-value-profiling.ll
new file mode 100644
index 0000000000000..21b1d68004b13
--- /dev/null
+++ b/llvm/test/Transforms/PGOProfile/amdgpu-disable-value-profiling.ll
@@ -0,0 +1,22 @@
+;; Test that value profiling (indirect call profiling) is disabled for GPU
targets.
+;; The device-side profiling runtime does not implement
+;; __llvm_profile_instrument_target, so indirect call profiling must not be
emitted.
+
+; RUN: opt < %s -passes=pgo-instr-gen -S | FileCheck %s
+
+target triple = "amdgcn-amd-amdhsa"
+
+@fptr = addrspace(1) global ptr null, align 8
+
+;; Verify that regular block instrumentation IS emitted
+; CHECK: call void @llvm.instrprof.increment
+
+;; Verify that value profiling for indirect calls is NOT emitted
+; CHECK-NOT: call void @llvm.instrprof.value.profile
+
+define amdgpu_kernel void @test_indirect_call() {
+entry:
+ %fp = load ptr, ptr addrspace(1) @fptr, align 8
+ call void %fp()
+ ret void
+}
>From 35d29a47051a1b195cded325fbe6ddefe926350e Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <[email protected]>
Date: Thu, 4 Jun 2026 10:34:22 -0400
Subject: [PATCH 2/2] [PGO][HIP] Link clang_rt.profile_rocm on the host for HIP
device PGO
When building HIP with device PGO (-fprofile-generate /
-fprofile-instr-generate)
the host program must drain device-side profile counters. That collection
runtime
now lives in the separate clang_rt.profile_rocm library (see the compiler-rt
change), so the driver must link it into the host image in addition to the base
clang_rt.profile.
Add clang_rt.profile_rocm to the host link in addOffloadRTLibs (which already
knows the active offload kinds and links libamdhip64) for both the Linux/GNU and
MSVC host toolchains, gated on the HIP offload kind, needsProfileRT, and the
library being present in the resource directory. Non-HIP profile links and HIP
links without profiling are unaffected.
---
clang/lib/Driver/ToolChains/Linux.cpp | 7 ++++
clang/lib/Driver/ToolChains/MSVC.cpp | 8 +++++
.../test/Driver/hip-profile-rocm-runtime.hip | 32 +++++++++++++++++++
3 files changed, 47 insertions(+)
create mode 100644 clang/test/Driver/hip-profile-rocm-runtime.hip
diff --git a/clang/lib/Driver/ToolChains/Linux.cpp
b/clang/lib/Driver/ToolChains/Linux.cpp
index 5f04afe34c554..227e5e2811dac 100644
--- a/clang/lib/Driver/ToolChains/Linux.cpp
+++ b/clang/lib/Driver/ToolChains/Linux.cpp
@@ -902,6 +902,13 @@ void Linux::addOffloadRTLibs(unsigned ActiveKinds, const
ArgList &Args,
if (ActiveKinds & Action::OFK_HIP)
CmdArgs.push_back(
Args.MakeArgString(StringRef("-L") + RocmInstallation->getLibPath()));
+
+ // For HIP with device PGO, link the host-side ROCm device-profile collection
+ // runtime (clang_rt.profile_rocm) when it is available. It registers a
+ // collector with the base profile runtime to drain device counters.
+ if ((ActiveKinds & Action::OFK_HIP) && needsProfileRT(Args) &&
+ getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static)))
+ CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm"));
}
void Linux::AddIAMCUIncludeArgs(const ArgList &DriverArgs,
diff --git a/clang/lib/Driver/ToolChains/MSVC.cpp
b/clang/lib/Driver/ToolChains/MSVC.cpp
index 6bc58699fb007..7829831f7d103 100644
--- a/clang/lib/Driver/ToolChains/MSVC.cpp
+++ b/clang/lib/Driver/ToolChains/MSVC.cpp
@@ -592,6 +592,14 @@ void MSVCToolChain::addOffloadRTLibs(unsigned ActiveKinds,
const ArgList &Args,
CmdArgs.append({Args.MakeArgString(StringRef("-libpath:") +
RocmInstallation->getLibPath()),
"amdhip64.lib"});
+
+ // For HIP with device PGO, link the host-side ROCm device-profile
+ // collection runtime (clang_rt.profile_rocm) when it is available. It
+ // registers a collector with the base profile runtime to drain device
+ // counters.
+ if (needsProfileRT(Args) &&
+ getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static)))
+ CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm"));
}
}
diff --git a/clang/test/Driver/hip-profile-rocm-runtime.hip
b/clang/test/Driver/hip-profile-rocm-runtime.hip
new file mode 100644
index 0000000000000..0fe3ecd5ab660
--- /dev/null
+++ b/clang/test/Driver/hip-profile-rocm-runtime.hip
@@ -0,0 +1,32 @@
+// REQUIRES: x86-registered-target, amdgpu-registered-target
+// UNSUPPORTED: system-windows
+
+// Build a fake resource dir containing both the base profile runtime and the
+// ROCm device-profile runtime so the driver's existence check passes.
+// RUN: rm -rf %t && mkdir -p %t/lib/x86_64-unknown-linux
+// RUN: touch %t/lib/x86_64-unknown-linux/libclang_rt.profile.a
+// RUN: touch %t/lib/x86_64-unknown-linux/libclang_rt.profile_rocm.a
+// RUN: touch %t.o
+
+// HIP host link with PGO links clang_rt.profile_rocm.
+// RUN: %clang -### --hip-link --target=x86_64-unknown-linux \
+// RUN: -fprofile-instr-generate -resource-dir=%t \
+// RUN: --rocm-path=%S/Inputs/rocm %t.o 2>&1 \
+// RUN: | FileCheck -check-prefix=HIP-PGO %s
+// The ROCm device-profile runtime is linked before the base profile runtime so
+// its reference to the collector hook resolves against the base archive.
+// HIP-PGO-DAG: "{{.*}}libclang_rt.profile_rocm.a"
+// HIP-PGO-DAG: "{{.*}}libclang_rt.profile.a"
+
+// Without profiling, the ROCm device-profile runtime is not linked.
+// RUN: %clang -### --hip-link --target=x86_64-unknown-linux \
+// RUN: -resource-dir=%t --rocm-path=%S/Inputs/rocm %t.o 2>&1 \
+// RUN: | FileCheck -check-prefix=HIP-NOPGO %s
+// HIP-NOPGO-NOT: libclang_rt.profile_rocm.a
+
+// A non-HIP host link with PGO does not link the ROCm device-profile runtime.
+// RUN: %clang -### --target=x86_64-unknown-linux \
+// RUN: -fprofile-instr-generate -resource-dir=%t %t.o 2>&1 \
+// RUN: | FileCheck -check-prefix=HOST-PGO %s
+// HOST-PGO: "{{.*}}libclang_rt.profile.a"
+// HOST-PGO-NOT: libclang_rt.profile_rocm.a
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits