[clang] [compiler-rt] [PGO][HIP] HSA-introspection device profile drain + GPU PGO tests (PR #203056)

via cfe-commits Sun, 14 Jun 2026 12:28:23 -0700

llvmorg-github-actions[bot] wrote:


<!--LLVM PR SUMMARY COMMENT-->

@llvm/pr-subscribers-pgo

Author: Larry Meadows (lfmeadow)

<details>
<summary>Changes</summary>

## Summary

Follow-up to #<!-- -->202095 (now landed). #<!-- -->202095's host-shadow 
device-profile drain can
only collect device counters for kernels that registered a host-side shadow via
`__hipRegisterVar`. Device-linked programs (e.g. RCCL), whose instrumented code
objects are linked directly into the device image with no host shadow, are never
drained.

This adds a **supplemental, Linux-only HSA-introspection drain** that runs after
the host-shadow drain: it walks each GPU agent, enumerates only the code objects
actually resident there, reads each one's `__llvm_profile_sections` table on the
device, and routes them through the existing `processDeviceOffloadPrf()` path so
the emitted `.profraw` layout is identical. A content-dedup set keyed on the
`(data, counters, names)` device-pointer triple ensures a section already 
drained
by the host-shadow pass is not drained twice, so the two passes compose without
double-counting.

It is purely additive — it does not modify #<!-- -->202095's host-shadow drain 
or its
launch-tracking. Highlights:

- `compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp`: HSA agent/segment/
  symbol walk + dedup; record drained bounds after each host-shadow drain; lazy
  HSA init (no library constructor, for fork-safety).
- Because the HSA walk only touches resident code objects, it lets us avoid the
  host-shadow drain's collect-all fallback on Linux. When **no** kernel launch 
was
  tracked (program never launches, collects before its first launch, or launches
  only via an untracked API), the host-shadow pass is skipped and the HSA drain
  covers it safely — instead of faulting/hanging reading a non-resident device 
on
  a multi-GPU host. This also closes the silent-data-loss gap for untracked 
launch
  APIs (`hipExtLaunchKernel`, cooperative/graph launches).
- `clang/lib/Driver/ToolChains/Clang.cpp` / `HIPAMD.cpp`: link the device 
profile
  runtime on both the new-offload-driver (`LinkerWrapper::ConstructJob`) and
  traditional (`lld`) link paths, guarded by `needsProfileRT` + VFS existence.
- New GPU/AMDGPU HIP device-PGO tests, a dependency-free `run_gpu_tests.py`
  "lit-lite" runner (no `llvm-lit`/in-tree `FileCheck` required), and a
  `device-pgo/` standalone build helper.

## Why a separate test harness

There are no AMD GPUs in upstream CI, so these `.hip` tests don't run in-tree;
`run_gpu_tests.py` lets a downstream GPU CI (e.g. ROCm/TheRock) execute them
against an installed toolchain. It parses the `REQUIRES`/`UNSUPPORTED`/`RUN`
slice of lit markup, applies a fixed substitution set, detects `multi-device`
from the runtime-visible GPU count, and provides `FileCheck`/`not` shims when 
the
real binaries aren't in the artifact.

## Test plan

- 4x gfx90a (`gfx90a:sramecc+:xnack-`), ROCm 7.1.
- `python3 compiler-rt/test/profile/run_gpu_tests.py --toolchain-bin 
&lt;abs&gt;/bin --hip-lib-path /opt/rocm/lib compiler-rt/test/profile/GPU 
compiler-rt/test/profile/AMDGPU`
- **12 passed, 0 failed, 0 unsupported.** Covers: basic/coverage/pgo-use,
  multiple-kernels, device-branching, multi-gpu and non-default-device drain,
  early-collect / no-kernel edges, RDC vs non-RDC `__llvm_profile_sections`,
  dedup (host-shadow drains the used device, HSA finds it and dedups), and
  fork-safety (the RCCL parent-no-HIP / kernel-in-forked-child pattern).
- Build is warning-clean and `git clang-format` clean.



---

Patch is 99.22 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/203056.diff


26 Files Affected:

- (modified) clang/lib/Driver/ToolChains/Clang.cpp (+15) 
- (modified) clang/lib/Driver/ToolChains/HIPAMD.cpp (+20) 
- (modified) clang/lib/Driver/ToolChains/Linux.cpp (+18-3) 
- (modified) clang/lib/Driver/ToolChains/MSVC.cpp (+19-12) 
- (modified) clang/test/Driver/hip-profile-rocm-runtime.hip (+16-2) 
- (modified) compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp (+630-52) 
- (added) compiler-rt/test/profile/AMDGPU/device-basic.hip (+67) 
- (added) compiler-rt/test/profile/AMDGPU/device-early-collect.hip (+68) 
- (added) compiler-rt/test/profile/AMDGPU/device-no-kernel.hip (+44) 
- (added) compiler-rt/test/profile/AMDGPU/device-symbols.hip (+42) 
- (added) compiler-rt/test/profile/AMDGPU/lit.local.cfg.py (+4) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-basic.hip (+51) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip (+63) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip 
(+56) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip (+51) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip (+67) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip (+61) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip (+57) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip 
(+63) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip (+58) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip 
(+60) 
- (added) compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip (+63) 
- (added) compiler-rt/test/profile/device-pgo/README.md (+125) 
- (added) compiler-rt/test/profile/device-pgo/build.sh (+56) 
- (added) compiler-rt/test/profile/device-pgo/toolchain-cache.cmake (+55) 
- (added) compiler-rt/test/profile/run_gpu_tests.py (+408) 


``````````diff
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp 
b/clang/lib/Driver/ToolChains/Clang.cpp
index c2ac478d84929..3b8bc46820af6 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9658,6 +9658,21 @@ void LinkerWrapper::ConstructJob(Compilation &C, const 
JobAction &JA,
           (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX()))
         LinkerArgs.emplace_back("-lompdevice");
 
+      // With PGO/coverage instrumentation, GPU device code references the
+      // device profile runtime (__llvm_profile_instrument_gpu and the
+      // __llvm_profile_sections bounds table emitted by
+      // InstrProfilingPlatformGPU). The offload device link does not otherwise
+      // pull it in, so forward the static device profile runtime to the GPU
+      // device linker. The archive is arch-suffixed, so pass its full path
+      // rather than a -l name.
+      if (ToolChain::needsProfileRT(Args) &&
+          (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX())) {
+        std::string ProfileRT =
+            TC->getCompilerRT(Args, "profile", ToolChain::FT_Static);
+        if (TC->getVFS().exists(ProfileRT))
+          LinkerArgs.emplace_back(Args.MakeArgString(ProfileRT));
+      }
+
       // For SPIR-V, pass some extra flags to `spirv-link`, the out-of-tree
       // SPIR-V linker. `spirv-link` isn't called in LTO mode so restrict these
       // flags to normal compilation.
diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp 
b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index 01cb23d0aa230..1bd4e073b4e27 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -19,6 +19,7 @@
 #include "clang/Options/Options.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/Path.h"
+#include "llvm/Support/VirtualFileSystem.h"
 #include "llvm/TargetParser/TargetParser.h"
 
 using namespace clang::driver;
@@ -142,6 +143,25 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, 
const JobAction &JA,
 
   LldArgs.push_back("--no-whole-archive");
 
+  // With PGO/coverage instrumentation, instrumented device code references the
+  // device profile runtime (__llvm_profile_instrument_gpu and the
+  // __llvm_profile_sections bounds table emitted by 
InstrProfilingPlatformGPU).
+  // The new-offload-driver path injects this in LinkerWrapper::ConstructJob,
+  // but HIP using the traditional offload path (e.g. on Windows, which does 
not
+  // route device linking through clang-linker-wrapper) reaches the device link
+  // here instead. Forward the static device profile runtime to this lld device
+  // link so the runtime is pulled in regardless of offload-driver/host OS. The
+  // archive is arch-suffixed, so pass its full path rather than a -l name.
+  if (ToolChain::needsProfileRT(Args)) {
+    std::string ProfileRT =
+        TC.getCompilerRT(Args, "profile", ToolChain::FT_Static);
+    // Use the ToolChain VFS (matches the new-offload-driver path in
+    // Clang.cpp) so overlay/virtual filesystems used by the driver are
+    // honored; llvm::sys::fs bypasses them and can wrongly skip the runtime.
+    if (TC.getVFS().exists(ProfileRT))
+      LldArgs.push_back(Args.MakeArgString(ProfileRT));
+  }
+
   const char *Lld = 
Args.MakeArgStringRef(getToolChain().GetProgramPath("lld"));
   C.addCommand(std::make_unique<Command>(JA, *this, 
ResponseFileSupport::None(),
                                          Lld, LldArgs, Inputs, Output));
diff --git a/clang/lib/Driver/ToolChains/Linux.cpp 
b/clang/lib/Driver/ToolChains/Linux.cpp
index 512788d235fec..00ae53af4865f 100644
--- a/clang/lib/Driver/ToolChains/Linux.cpp
+++ b/clang/lib/Driver/ToolChains/Linux.cpp
@@ -906,13 +906,28 @@ void Linux::addOffloadRTLibs(unsigned ActiveKinds, const 
ArgList &Args,
         Args.MakeArgString(StringRef("-L") + RocmInstallation->getLibPath()));
 
   // For HIP device PGO, link clang_rt.profile_rocm when available. It is a
-  // self-contained superset of clang_rt.profile, emitted first so the base
-  // archive stays inert.
-  if ((ActiveKinds & Action::OFK_HIP) && needsProfileRT(Args) &&
+  // self-contained superset of clang_rt.profile, emitted first (before the
+  // base archive added by addProfileRTLibs) so the base archive stays inert.
+  //
+  // This is intentionally not gated on Action::OFK_HIP. HIP host objects are
+  // routinely linked into a shared library or executable from pre-compiled
+  // .o files (e.g. RCCL's librccl.so), a link command that carries no HIP
+  // offload action yet still needs the device-counter drain. Gating on
+  // OFK_HIP would silently drop the drain for those object-only links and
+  // the resulting .profraw would contain host counters only. profile_rocm is
+  // self-contained and both its hipModuleLoad interceptor and its
+  // device-collection drain self-skip at runtime when the process has no
+  // resident device code, so linking it into a non-HIP instrumented binary is
+  // harmless. It is only present on ROCm-equipped toolchains in the first
+  // place (the getVFS().exists check below).
+  if (needsProfileRT(Args) &&
       getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) {
     CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm"));
     // Force-retain the constructor-only hipModuleLoad* interceptor object; its
     // constructor self-skips when the program does not use hipModuleLoad.
+    // Pulling this object in also pulls the device-counter drain
+    // (__llvm_profile_hip_collect_device_data) from the same translation unit,
+    // which InstrProfilingFile.c invokes through a weak reference at exit.
     CmdArgs.push_back("-u");
     CmdArgs.push_back("__llvm_profile_offload_register_dynamic_module");
   }
diff --git a/clang/lib/Driver/ToolChains/MSVC.cpp 
b/clang/lib/Driver/ToolChains/MSVC.cpp
index 0796bdff96d46..9a7df6af7727c 100644
--- a/clang/lib/Driver/ToolChains/MSVC.cpp
+++ b/clang/lib/Driver/ToolChains/MSVC.cpp
@@ -598,19 +598,26 @@ void MSVCToolChain::addOffloadRTLibs(unsigned 
ActiveKinds, const ArgList &Args,
     CmdArgs.append({Args.MakeArgString(StringRef("-libpath:") +
                                        RocmInstallation->getLibPath()),
                     "amdhip64.lib"});
+  }
 
-    // For HIP device PGO, link clang_rt.profile_rocm when available. It is a
-    // self-contained superset of clang_rt.profile, emitted first so the base
-    // archive stays inert (avoiding a /MD-vs-/MT CRT mix in the host image).
-    if (needsProfileRT(Args) &&
-        getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) {
-      CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm"));
-      // Force the linker to retain the constructor-only hipModuleLoad*
-      // interceptor object from clang_rt.profile_rocm (see Linux.cpp). The
-      // constructor self-skips for programs that do not use hipModuleLoad.
-      CmdArgs.push_back(
-          "-include:__llvm_profile_offload_register_dynamic_module");
-    }
+  // For HIP device PGO, link clang_rt.profile_rocm when available. It is a
+  // self-contained superset of clang_rt.profile, emitted first so the base
+  // archive stays inert (avoiding a /MD-vs-/MT CRT mix in the host image).
+  //
+  // Not gated on Action::OFK_HIP: HIP host objects are routinely linked into a
+  // DLL or executable from pre-compiled .obj files, a link that carries no HIP
+  // offload action yet still needs the device-counter drain (see Linux.cpp for
+  // the full rationale). profile_rocm self-skips at runtime when the process
+  // has no resident device code, and is only present on ROCm-equipped
+  // toolchains (the getVFS().exists check below).
+  if (needsProfileRT(Args) &&
+      getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) {
+    CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm"));
+    // Force the linker to retain the constructor-only hipModuleLoad*
+    // interceptor object from clang_rt.profile_rocm (see Linux.cpp). The
+    // constructor self-skips for programs that do not use hipModuleLoad.
+    CmdArgs.push_back(
+        "-include:__llvm_profile_offload_register_dynamic_module");
   }
 }
 
diff --git a/clang/test/Driver/hip-profile-rocm-runtime.hip 
b/clang/test/Driver/hip-profile-rocm-runtime.hip
index fc82db4fc13c0..9346f05dedf42 100644
--- a/clang/test/Driver/hip-profile-rocm-runtime.hip
+++ b/clang/test/Driver/hip-profile-rocm-runtime.hip
@@ -25,9 +25,23 @@
 // RUN:   | FileCheck -check-prefix=HIP-NOPGO %s
 // HIP-NOPGO-NOT: libclang_rt.profile_rocm.a
 
-// A non-HIP host link with PGO does not link the ROCm device-profile runtime.
+// An object-only host link with PGO (no HIP offload action) still links the
+// ROCm device-profile runtime when it is available in the toolchain. HIP host
+// code is frequently linked into a library/executable from pre-compiled
+// objects, a link that carries no OFK_HIP yet still needs the device drain.
 // RUN: %clang -### --target=x86_64-unknown-linux \
 // RUN:   -fprofile-instr-generate -resource-dir=%t %t.o 2>&1 \
 // RUN:   | FileCheck -check-prefix=HOST-PGO %s
+// HOST-PGO: "{{.*}}libclang_rt.profile_rocm.a"
+// HOST-PGO: "-u" "__llvm_profile_offload_register_dynamic_module"
 // HOST-PGO: "{{.*}}libclang_rt.profile.a"
-// HOST-PGO-NOT: libclang_rt.profile_rocm.a
+
+// On a lean toolchain that ships only the base profile runtime (no
+// profile_rocm), nothing extra is linked and the link still succeeds.
+// RUN: rm -rf %t2 && mkdir -p %t2/lib/x86_64-unknown-linux
+// RUN: touch %t2/lib/x86_64-unknown-linux/libclang_rt.profile.a
+// RUN: %clang -### --target=x86_64-unknown-linux \
+// RUN:   -fprofile-instr-generate -resource-dir=%t2 %t.o 2>&1 \
+// RUN:   | FileCheck -check-prefix=NO-ROCM-RT %s
+// NO-ROCM-RT: "{{.*}}libclang_rt.profile.a"
+// NO-ROCM-RT-NOT: libclang_rt.profile_rocm.a
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp 
b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index d0d9b1ea8f61d..b1db1d8a74041 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -66,6 +66,15 @@ struct OffloadSectionShadowGroup;
 static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
                                    const OffloadSectionShadowGroup *Sections);
 
+#if defined(__linux__) && !defined(_WIN32)
+// Record a drained section-bounds tuple so the supplemental HSA-introspection
+// pass (Linux only) skips any code object the host-shadow path already
+// drained. Defined alongside the HSA drain below; forward-declared here so
+// processDeviceOffloadPrf can register every successful host-shadow drain.
+static void profRecordDrainedBounds(const void *Data, const void *Counters,
+                                    const void *Names);
+#endif
+
 static int isVerboseMode() {
   static int IsVerbose = -1;
   if (IsVerbose == -1)
@@ -1119,8 +1128,14 @@ static int processDeviceOffloadPrf(void 
*DeviceOffloadPrf, const char *Target,
 
   if (ret != 0) {
     PROF_ERR("%s\n", "failed to write device profile using shared API");
-  } else if (isVerboseMode()) {
-    PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
+  } else {
+#if defined(__linux__) && !defined(_WIN32)
+    // Dedup against the supplemental HSA pass: this section is now drained, so
+    // the HSA walk must not drain the same device code object again.
+    profRecordDrainedBounds(DevDataBegin, DevCntsBegin, DevNamesBegin);
+#endif
+    if (isVerboseMode())
+      PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
   }
 
   return ret;
@@ -1148,72 +1163,635 @@ static int isHipAvailable(void) {
   return pHipMemcpy != nullptr && pHipGetSymbolAddress != nullptr;
 }
 
-/* -------------------------------------------------------------------------- 
*/
-/*  Collect device-side profile data                                          
*/
-/* -------------------------------------------------------------------------- 
*/
+/* ========================================================================== 
*/
+/*  Supplemental HSA-introspection drain (Linux only)                         
*/
+/*                                                                            
*/
+/*  The host-shadow drain above only sees device code objects registered      
*/
+/*  host-side (__hipRegisterVar shadows) or loaded through an intercepted */
+/*  hipModuleLoad* call. Device code linked by the offload device linker with 
*/
+/*  no host-side shadow -- e.g. RCCL, whose many device functions are glued */
+/*  into a single kernel with no source module -- is invisible to it. This */
+/*  pass walks every GPU agent's loaded executables via HSA, finds each */
+/*  __llvm_profile_sections table directly on the device, and drains the ones 
*/
+/*  the host-shadow pass did not already handle (deduped by the device */
+/*  section-bounds tuple). It reuses processDeviceOffloadPrf() for the */
+/*  copy/relocate/write so the on-disk profraw layout is identical.           
*/
+/* ========================================================================== 
*/
+#if defined(__linux__) && !defined(_WIN32)
 
-extern "C" int __llvm_profile_hip_collect_device_data(void) {
-  if (NumShadowVariables == 0 && NumDynamicModules == 0)
+/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at
+ * build time, so mirror just the handful of HSA declarations the drain needs.
+ * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */
+typedef uint32_t prof_hsa_status_t;
+#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
+#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
+
+typedef struct {
+  uint64_t handle;
+} prof_hsa_agent_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_symbol_t;
+
+typedef uint32_t prof_hsa_agent_info_t;
+#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
+#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
+
+typedef uint32_t prof_hsa_device_type_t;
+#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
+
+typedef uint32_t prof_hsa_symbol_kind_t;
+#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
+
+typedef uint32_t prof_hsa_executable_symbol_info_t;
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE                                   
\
+  ((prof_hsa_executable_symbol_info_t)0)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH                            
\
+  ((prof_hsa_executable_symbol_info_t)1)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME                                   
\
+  ((prof_hsa_executable_symbol_info_t)2)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS                       
\
+  ((prof_hsa_executable_symbol_info_t)21)
+
+#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
+
+typedef uint32_t prof_hsa_loader_storage_type_t;
+
+typedef struct {
+  prof_hsa_agent_t agent;
+  prof_hsa_executable_t executable;
+  prof_hsa_loader_storage_type_t code_object_storage_type;
+  const void *code_object_storage_base;
+  size_t code_object_storage_size;
+  size_t code_object_storage_offset;
+  const void *segment_base;
+  size_t segment_size;
+} prof_hsa_loader_segment_descriptor_t;
+
+typedef prof_hsa_status_t (*hsa_init_ty)(void);
+typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
+    prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
+typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
+                                                   prof_hsa_agent_info_t,
+                                                   void *);
+typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
+    prof_hsa_executable_t, prof_hsa_agent_t,
+    prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
+                          prof_hsa_executable_symbol_t, void *),
+    void *);
+typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
+    prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
+typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
+                                                                     uint16_t,
+                                                                     size_t,
+                                                                     void *);
+typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
+    prof_hsa_loader_segment_descriptor_t *, size_t *);
+
+/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only
+ * query_segment_descriptors is used; query_host_address keeps the offset. */
+typedef struct {
+  void *query_host_address;
+  hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
+} prof_hsa_loader_pfn_t;
+
+static hsa_iterate_agents_ty pHsaIterateAgents = nullptr;
+static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr;
+static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr;
+static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr;
+static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr;
+
+/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/
+ * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees
+ * the fully-written p* function pointers (published before the release store
+ * of HsaRuntimeState=1 below). */
+static int HsaRuntimeState = 0;
+
+static int setHsaRuntimeState(int S) {
+  __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE);
+  return S > 0 ? 0 : -1;
+}
+
+/* Resolve HSA entry points (and the AMD loader extension) once, and confirm
+ * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is
+ * resolved by the shared ensureHipLoaded() above. */
+static int loadHsaRuntimePointers(void) {
+  int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE);
+  if (State)
+    return State > 0 ? 0 : -1;
+
+  if (!__interception::DynamicLoaderAvailable()) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "Dynamic library loading not available - "
+                      "HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so");
+  if (!Hsa)
+    Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1");
+  if (!Hsa) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "libhsa-runtime64.so not loadable - "
+                      "HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  hsa_init_ty pHsaInit =
+      (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init");
+  hsa_system_get_major_extension_table_ty pGetExtTable =
+      (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol(
+          Hsa, "hsa_system_get_major_extension_table");
+  pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol(
+      Hsa, "hsa_iterate_agents");
+  pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol(
+      Hsa, "hsa_agent_get_info");
+  pHsaExecIterAgentSyms =
+      (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol(
+          Hsa, "hsa_executable_iterate_agent_symbols");
+  pHsaSymGetInfo =
+      (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol(
+          Hsa, "hsa_executable_symbol_get_info");
+
+  if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo ||
+      !pHsaExecIterAgentSyms || !pHsaSymGetInfo) {
+    PROF_WARN("%s",
+              "required HSA symbols missing - HSA device profiling 
disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain
+   * rather than from the library constructor, so merely loading the
+   * instrumented library does not initialize HSA in the process -- which would
+   * break fork-based callers that deliberately keep HIP/HSA uninitialized in
+   * the parent (see the constructor note at the end of the HSA block). In the
+   * common case the drain runs from the profile write path while HSA is still
+   * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed,
+   * this simply re-initializes HSA (the process is exiting anyway). */
+  prof_hsa_status_t St = pHsaInit();
+  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+    if (isVerboseMode())
+      PROF_NOTE("hsa_init failed (0x%x) - HSA device ...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/203056
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [compiler-rt] [PGO][HIP] HSA-introspection device profile drain + GPU PGO tests (PR #203056)

Reply via email to