llvmorg-github-actions[bot] wrote:
<!--LLVM PR SUMMARY COMMENT--> @llvm/pr-subscribers-pgo Author: Larry Meadows (lfmeadow) <details> <summary>Changes</summary> ## Summary Follow-up to #<!-- -->202095 (now landed). #<!-- -->202095's host-shadow device-profile drain can only collect device counters for kernels that registered a host-side shadow via `__hipRegisterVar`. Device-linked programs (e.g. RCCL), whose instrumented code objects are linked directly into the device image with no host shadow, are never drained. This adds a **supplemental, Linux-only HSA-introspection drain** that runs after the host-shadow drain: it walks each GPU agent, enumerates only the code objects actually resident there, reads each one's `__llvm_profile_sections` table on the device, and routes them through the existing `processDeviceOffloadPrf()` path so the emitted `.profraw` layout is identical. A content-dedup set keyed on the `(data, counters, names)` device-pointer triple ensures a section already drained by the host-shadow pass is not drained twice, so the two passes compose without double-counting. It is purely additive — it does not modify #<!-- -->202095's host-shadow drain or its launch-tracking. Highlights: - `compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp`: HSA agent/segment/ symbol walk + dedup; record drained bounds after each host-shadow drain; lazy HSA init (no library constructor, for fork-safety). - Because the HSA walk only touches resident code objects, it lets us avoid the host-shadow drain's collect-all fallback on Linux. When **no** kernel launch was tracked (program never launches, collects before its first launch, or launches only via an untracked API), the host-shadow pass is skipped and the HSA drain covers it safely — instead of faulting/hanging reading a non-resident device on a multi-GPU host. This also closes the silent-data-loss gap for untracked launch APIs (`hipExtLaunchKernel`, cooperative/graph launches). - `clang/lib/Driver/ToolChains/Clang.cpp` / `HIPAMD.cpp`: link the device profile runtime on both the new-offload-driver (`LinkerWrapper::ConstructJob`) and traditional (`lld`) link paths, guarded by `needsProfileRT` + VFS existence. - New GPU/AMDGPU HIP device-PGO tests, a dependency-free `run_gpu_tests.py` "lit-lite" runner (no `llvm-lit`/in-tree `FileCheck` required), and a `device-pgo/` standalone build helper. ## Why a separate test harness There are no AMD GPUs in upstream CI, so these `.hip` tests don't run in-tree; `run_gpu_tests.py` lets a downstream GPU CI (e.g. ROCm/TheRock) execute them against an installed toolchain. It parses the `REQUIRES`/`UNSUPPORTED`/`RUN` slice of lit markup, applies a fixed substitution set, detects `multi-device` from the runtime-visible GPU count, and provides `FileCheck`/`not` shims when the real binaries aren't in the artifact. ## Test plan - 4x gfx90a (`gfx90a:sramecc+:xnack-`), ROCm 7.1. - `python3 compiler-rt/test/profile/run_gpu_tests.py --toolchain-bin <abs>/bin --hip-lib-path /opt/rocm/lib compiler-rt/test/profile/GPU compiler-rt/test/profile/AMDGPU` - **12 passed, 0 failed, 0 unsupported.** Covers: basic/coverage/pgo-use, multiple-kernels, device-branching, multi-gpu and non-default-device drain, early-collect / no-kernel edges, RDC vs non-RDC `__llvm_profile_sections`, dedup (host-shadow drains the used device, HSA finds it and dedups), and fork-safety (the RCCL parent-no-HIP / kernel-in-forked-child pattern). - Build is warning-clean and `git clang-format` clean. --- Patch is 99.22 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/203056.diff 26 Files Affected: - (modified) clang/lib/Driver/ToolChains/Clang.cpp (+15) - (modified) clang/lib/Driver/ToolChains/HIPAMD.cpp (+20) - (modified) clang/lib/Driver/ToolChains/Linux.cpp (+18-3) - (modified) clang/lib/Driver/ToolChains/MSVC.cpp (+19-12) - (modified) clang/test/Driver/hip-profile-rocm-runtime.hip (+16-2) - (modified) compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp (+630-52) - (added) compiler-rt/test/profile/AMDGPU/device-basic.hip (+67) - (added) compiler-rt/test/profile/AMDGPU/device-early-collect.hip (+68) - (added) compiler-rt/test/profile/AMDGPU/device-no-kernel.hip (+44) - (added) compiler-rt/test/profile/AMDGPU/device-symbols.hip (+42) - (added) compiler-rt/test/profile/AMDGPU/lit.local.cfg.py (+4) - (added) compiler-rt/test/profile/GPU/instrprof-hip-basic.hip (+51) - (added) compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip (+63) - (added) compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip (+56) - (added) compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip (+51) - (added) compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip (+67) - (added) compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip (+61) - (added) compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip (+57) - (added) compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip (+63) - (added) compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip (+58) - (added) compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip (+60) - (added) compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip (+63) - (added) compiler-rt/test/profile/device-pgo/README.md (+125) - (added) compiler-rt/test/profile/device-pgo/build.sh (+56) - (added) compiler-rt/test/profile/device-pgo/toolchain-cache.cmake (+55) - (added) compiler-rt/test/profile/run_gpu_tests.py (+408) ``````````diff diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp index c2ac478d84929..3b8bc46820af6 100644 --- a/clang/lib/Driver/ToolChains/Clang.cpp +++ b/clang/lib/Driver/ToolChains/Clang.cpp @@ -9658,6 +9658,21 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA, (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX())) LinkerArgs.emplace_back("-lompdevice"); + // With PGO/coverage instrumentation, GPU device code references the + // device profile runtime (__llvm_profile_instrument_gpu and the + // __llvm_profile_sections bounds table emitted by + // InstrProfilingPlatformGPU). The offload device link does not otherwise + // pull it in, so forward the static device profile runtime to the GPU + // device linker. The archive is arch-suffixed, so pass its full path + // rather than a -l name. + if (ToolChain::needsProfileRT(Args) && + (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX())) { + std::string ProfileRT = + TC->getCompilerRT(Args, "profile", ToolChain::FT_Static); + if (TC->getVFS().exists(ProfileRT)) + LinkerArgs.emplace_back(Args.MakeArgString(ProfileRT)); + } + // For SPIR-V, pass some extra flags to `spirv-link`, the out-of-tree // SPIR-V linker. `spirv-link` isn't called in LTO mode so restrict these // flags to normal compilation. diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp index 01cb23d0aa230..1bd4e073b4e27 100644 --- a/clang/lib/Driver/ToolChains/HIPAMD.cpp +++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp @@ -19,6 +19,7 @@ #include "clang/Options/Options.h" #include "llvm/Support/FileSystem.h" #include "llvm/Support/Path.h" +#include "llvm/Support/VirtualFileSystem.h" #include "llvm/TargetParser/TargetParser.h" using namespace clang::driver; @@ -142,6 +143,25 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA, LldArgs.push_back("--no-whole-archive"); + // With PGO/coverage instrumentation, instrumented device code references the + // device profile runtime (__llvm_profile_instrument_gpu and the + // __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU). + // The new-offload-driver path injects this in LinkerWrapper::ConstructJob, + // but HIP using the traditional offload path (e.g. on Windows, which does not + // route device linking through clang-linker-wrapper) reaches the device link + // here instead. Forward the static device profile runtime to this lld device + // link so the runtime is pulled in regardless of offload-driver/host OS. The + // archive is arch-suffixed, so pass its full path rather than a -l name. + if (ToolChain::needsProfileRT(Args)) { + std::string ProfileRT = + TC.getCompilerRT(Args, "profile", ToolChain::FT_Static); + // Use the ToolChain VFS (matches the new-offload-driver path in + // Clang.cpp) so overlay/virtual filesystems used by the driver are + // honored; llvm::sys::fs bypasses them and can wrongly skip the runtime. + if (TC.getVFS().exists(ProfileRT)) + LldArgs.push_back(Args.MakeArgString(ProfileRT)); + } + const char *Lld = Args.MakeArgStringRef(getToolChain().GetProgramPath("lld")); C.addCommand(std::make_unique<Command>(JA, *this, ResponseFileSupport::None(), Lld, LldArgs, Inputs, Output)); diff --git a/clang/lib/Driver/ToolChains/Linux.cpp b/clang/lib/Driver/ToolChains/Linux.cpp index 512788d235fec..00ae53af4865f 100644 --- a/clang/lib/Driver/ToolChains/Linux.cpp +++ b/clang/lib/Driver/ToolChains/Linux.cpp @@ -906,13 +906,28 @@ void Linux::addOffloadRTLibs(unsigned ActiveKinds, const ArgList &Args, Args.MakeArgString(StringRef("-L") + RocmInstallation->getLibPath())); // For HIP device PGO, link clang_rt.profile_rocm when available. It is a - // self-contained superset of clang_rt.profile, emitted first so the base - // archive stays inert. - if ((ActiveKinds & Action::OFK_HIP) && needsProfileRT(Args) && + // self-contained superset of clang_rt.profile, emitted first (before the + // base archive added by addProfileRTLibs) so the base archive stays inert. + // + // This is intentionally not gated on Action::OFK_HIP. HIP host objects are + // routinely linked into a shared library or executable from pre-compiled + // .o files (e.g. RCCL's librccl.so), a link command that carries no HIP + // offload action yet still needs the device-counter drain. Gating on + // OFK_HIP would silently drop the drain for those object-only links and + // the resulting .profraw would contain host counters only. profile_rocm is + // self-contained and both its hipModuleLoad interceptor and its + // device-collection drain self-skip at runtime when the process has no + // resident device code, so linking it into a non-HIP instrumented binary is + // harmless. It is only present on ROCm-equipped toolchains in the first + // place (the getVFS().exists check below). + if (needsProfileRT(Args) && getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) { CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm")); // Force-retain the constructor-only hipModuleLoad* interceptor object; its // constructor self-skips when the program does not use hipModuleLoad. + // Pulling this object in also pulls the device-counter drain + // (__llvm_profile_hip_collect_device_data) from the same translation unit, + // which InstrProfilingFile.c invokes through a weak reference at exit. CmdArgs.push_back("-u"); CmdArgs.push_back("__llvm_profile_offload_register_dynamic_module"); } diff --git a/clang/lib/Driver/ToolChains/MSVC.cpp b/clang/lib/Driver/ToolChains/MSVC.cpp index 0796bdff96d46..9a7df6af7727c 100644 --- a/clang/lib/Driver/ToolChains/MSVC.cpp +++ b/clang/lib/Driver/ToolChains/MSVC.cpp @@ -598,19 +598,26 @@ void MSVCToolChain::addOffloadRTLibs(unsigned ActiveKinds, const ArgList &Args, CmdArgs.append({Args.MakeArgString(StringRef("-libpath:") + RocmInstallation->getLibPath()), "amdhip64.lib"}); + } - // For HIP device PGO, link clang_rt.profile_rocm when available. It is a - // self-contained superset of clang_rt.profile, emitted first so the base - // archive stays inert (avoiding a /MD-vs-/MT CRT mix in the host image). - if (needsProfileRT(Args) && - getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) { - CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm")); - // Force the linker to retain the constructor-only hipModuleLoad* - // interceptor object from clang_rt.profile_rocm (see Linux.cpp). The - // constructor self-skips for programs that do not use hipModuleLoad. - CmdArgs.push_back( - "-include:__llvm_profile_offload_register_dynamic_module"); - } + // For HIP device PGO, link clang_rt.profile_rocm when available. It is a + // self-contained superset of clang_rt.profile, emitted first so the base + // archive stays inert (avoiding a /MD-vs-/MT CRT mix in the host image). + // + // Not gated on Action::OFK_HIP: HIP host objects are routinely linked into a + // DLL or executable from pre-compiled .obj files, a link that carries no HIP + // offload action yet still needs the device-counter drain (see Linux.cpp for + // the full rationale). profile_rocm self-skips at runtime when the process + // has no resident device code, and is only present on ROCm-equipped + // toolchains (the getVFS().exists check below). + if (needsProfileRT(Args) && + getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) { + CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm")); + // Force the linker to retain the constructor-only hipModuleLoad* + // interceptor object from clang_rt.profile_rocm (see Linux.cpp). The + // constructor self-skips for programs that do not use hipModuleLoad. + CmdArgs.push_back( + "-include:__llvm_profile_offload_register_dynamic_module"); } } diff --git a/clang/test/Driver/hip-profile-rocm-runtime.hip b/clang/test/Driver/hip-profile-rocm-runtime.hip index fc82db4fc13c0..9346f05dedf42 100644 --- a/clang/test/Driver/hip-profile-rocm-runtime.hip +++ b/clang/test/Driver/hip-profile-rocm-runtime.hip @@ -25,9 +25,23 @@ // RUN: | FileCheck -check-prefix=HIP-NOPGO %s // HIP-NOPGO-NOT: libclang_rt.profile_rocm.a -// A non-HIP host link with PGO does not link the ROCm device-profile runtime. +// An object-only host link with PGO (no HIP offload action) still links the +// ROCm device-profile runtime when it is available in the toolchain. HIP host +// code is frequently linked into a library/executable from pre-compiled +// objects, a link that carries no OFK_HIP yet still needs the device drain. // RUN: %clang -### --target=x86_64-unknown-linux \ // RUN: -fprofile-instr-generate -resource-dir=%t %t.o 2>&1 \ // RUN: | FileCheck -check-prefix=HOST-PGO %s +// HOST-PGO: "{{.*}}libclang_rt.profile_rocm.a" +// HOST-PGO: "-u" "__llvm_profile_offload_register_dynamic_module" // HOST-PGO: "{{.*}}libclang_rt.profile.a" -// HOST-PGO-NOT: libclang_rt.profile_rocm.a + +// On a lean toolchain that ships only the base profile runtime (no +// profile_rocm), nothing extra is linked and the link still succeeds. +// RUN: rm -rf %t2 && mkdir -p %t2/lib/x86_64-unknown-linux +// RUN: touch %t2/lib/x86_64-unknown-linux/libclang_rt.profile.a +// RUN: %clang -### --target=x86_64-unknown-linux \ +// RUN: -fprofile-instr-generate -resource-dir=%t2 %t.o 2>&1 \ +// RUN: | FileCheck -check-prefix=NO-ROCM-RT %s +// NO-ROCM-RT: "{{.*}}libclang_rt.profile.a" +// NO-ROCM-RT-NOT: libclang_rt.profile_rocm.a diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp index d0d9b1ea8f61d..b1db1d8a74041 100644 --- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp +++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp @@ -66,6 +66,15 @@ struct OffloadSectionShadowGroup; static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target, const OffloadSectionShadowGroup *Sections); +#if defined(__linux__) && !defined(_WIN32) +// Record a drained section-bounds tuple so the supplemental HSA-introspection +// pass (Linux only) skips any code object the host-shadow path already +// drained. Defined alongside the HSA drain below; forward-declared here so +// processDeviceOffloadPrf can register every successful host-shadow drain. +static void profRecordDrainedBounds(const void *Data, const void *Counters, + const void *Names); +#endif + static int isVerboseMode() { static int IsVerbose = -1; if (IsVerbose == -1) @@ -1119,8 +1128,14 @@ static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target, if (ret != 0) { PROF_ERR("%s\n", "failed to write device profile using shared API"); - } else if (isVerboseMode()) { - PROF_NOTE("%s\n", "Successfully wrote device profile using shared API"); + } else { +#if defined(__linux__) && !defined(_WIN32) + // Dedup against the supplemental HSA pass: this section is now drained, so + // the HSA walk must not drain the same device code object again. + profRecordDrainedBounds(DevDataBegin, DevCntsBegin, DevNamesBegin); +#endif + if (isVerboseMode()) + PROF_NOTE("%s\n", "Successfully wrote device profile using shared API"); } return ret; @@ -1148,72 +1163,635 @@ static int isHipAvailable(void) { return pHipMemcpy != nullptr && pHipGetSymbolAddress != nullptr; } -/* -------------------------------------------------------------------------- */ -/* Collect device-side profile data */ -/* -------------------------------------------------------------------------- */ +/* ========================================================================== */ +/* Supplemental HSA-introspection drain (Linux only) */ +/* */ +/* The host-shadow drain above only sees device code objects registered */ +/* host-side (__hipRegisterVar shadows) or loaded through an intercepted */ +/* hipModuleLoad* call. Device code linked by the offload device linker with */ +/* no host-side shadow -- e.g. RCCL, whose many device functions are glued */ +/* into a single kernel with no source module -- is invisible to it. This */ +/* pass walks every GPU agent's loaded executables via HSA, finds each */ +/* __llvm_profile_sections table directly on the device, and drains the ones */ +/* the host-shadow pass did not already handle (deduped by the device */ +/* section-bounds tuple). It reuses processDeviceOffloadPrf() for the */ +/* copy/relocate/write so the on-disk profraw layout is identical. */ +/* ========================================================================== */ +#if defined(__linux__) && !defined(_WIN32) -extern "C" int __llvm_profile_hip_collect_device_data(void) { - if (NumShadowVariables == 0 && NumDynamicModules == 0) +/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at + * build time, so mirror just the handful of HSA declarations the drain needs. + * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */ +typedef uint32_t prof_hsa_status_t; +#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0) +#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1) + +typedef struct { + uint64_t handle; +} prof_hsa_agent_t; +typedef struct { + uint64_t handle; +} prof_hsa_executable_t; +typedef struct { + uint64_t handle; +} prof_hsa_executable_symbol_t; + +typedef uint32_t prof_hsa_agent_info_t; +#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0) +#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17) + +typedef uint32_t prof_hsa_device_type_t; +#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1) + +typedef uint32_t prof_hsa_symbol_kind_t; +#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0) + +typedef uint32_t prof_hsa_executable_symbol_info_t; +#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE \ + ((prof_hsa_executable_symbol_info_t)0) +#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH \ + ((prof_hsa_executable_symbol_info_t)1) +#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME \ + ((prof_hsa_executable_symbol_info_t)2) +#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS \ + ((prof_hsa_executable_symbol_info_t)21) + +#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201) + +typedef uint32_t prof_hsa_loader_storage_type_t; + +typedef struct { + prof_hsa_agent_t agent; + prof_hsa_executable_t executable; + prof_hsa_loader_storage_type_t code_object_storage_type; + const void *code_object_storage_base; + size_t code_object_storage_size; + size_t code_object_storage_offset; + const void *segment_base; + size_t segment_size; +} prof_hsa_loader_segment_descriptor_t; + +typedef prof_hsa_status_t (*hsa_init_ty)(void); +typedef prof_hsa_status_t (*hsa_iterate_agents_ty)( + prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *); +typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t, + prof_hsa_agent_info_t, + void *); +typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)( + prof_hsa_executable_t, prof_hsa_agent_t, + prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t, + prof_hsa_executable_symbol_t, void *), + void *); +typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)( + prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *); +typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t, + uint16_t, + size_t, + void *); +typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)( + prof_hsa_loader_segment_descriptor_t *, size_t *); + +/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only + * query_segment_descriptors is used; query_host_address keeps the offset. */ +typedef struct { + void *query_host_address; + hsa_loader_query_segment_descriptors_ty query_segment_descriptors; +} prof_hsa_loader_pfn_t; + +static hsa_iterate_agents_ty pHsaIterateAgents = nullptr; +static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr; +static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr; +static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr; +static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr; + +/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/ + * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees + * the fully-written p* function pointers (published before the release store + * of HsaRuntimeState=1 below). */ +static int HsaRuntimeState = 0; + +static int setHsaRuntimeState(int S) { + __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE); + return S > 0 ? 0 : -1; +} + +/* Resolve HSA entry points (and the AMD loader extension) once, and confirm + * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is + * resolved by the shared ensureHipLoaded() above. */ +static int loadHsaRuntimePointers(void) { + int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE); + if (State) + return State > 0 ? 0 : -1; + + if (!__interception::DynamicLoaderAvailable()) { + if (isVerboseMode()) + PROF_NOTE("%s", "Dynamic library loading not available - " + "HSA device profiling disabled\n"); + return setHsaRuntimeState(-1); + } + + void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so"); + if (!Hsa) + Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1"); + if (!Hsa) { + if (isVerboseMode()) + PROF_NOTE("%s", "libhsa-runtime64.so not loadable - " + "HSA device profiling disabled\n"); + return setHsaRuntimeState(-1); + } + + hsa_init_ty pHsaInit = + (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init"); + hsa_system_get_major_extension_table_ty pGetExtTable = + (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol( + Hsa, "hsa_system_get_major_extension_table"); + pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol( + Hsa, "hsa_iterate_agents"); + pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol( + Hsa, "hsa_agent_get_info"); + pHsaExecIterAgentSyms = + (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol( + Hsa, "hsa_executable_iterate_agent_symbols"); + pHsaSymGetInfo = + (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol( + Hsa, "hsa_executable_symbol_get_info"); + + if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo || + !pHsaExecIterAgentSyms || !pHsaSymGetInfo) { + PROF_WARN("%s", + "required HSA symbols missing - HSA device profiling disabled\n"); + return setHsaRuntimeState(-1); + } + + /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain + * rather than from the library constructor, so merely loading the + * instrumented library does not initialize HSA in the process -- which would + * break fork-based callers that deliberately keep HIP/HSA uninitialized in + * the parent (see the constructor note at the end of the HSA block). In the + * common case the drain runs from the profile write path while HSA is still + * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed, + * this simply re-initializes HSA (the process is exiting anyway). */ + prof_hsa_status_t St = pHsaInit(); + if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) { + if (isVerboseMode()) + PROF_NOTE("hsa_init failed (0x%x) - HSA device ... [truncated] `````````` </details> https://github.com/llvm/llvm-project/pull/203056 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
