https://github.com/lfmeadow updated https://github.com/llvm/llvm-project/pull/203056
>From 36178b0e9ddff1038e6f13e80f6f4dcd2ab68226 Mon Sep 17 00:00:00 2001 From: Larry Meadows <[email protected]> Date: Sun, 14 Jun 2026 14:18:38 -0500 Subject: [PATCH] [PGO][HIP] Add supplemental HSA-introspection device drain and GPU PGO tests The host-shadow device-profile drain (InstrProfilingPlatformROCm.cpp) can only collect device counters for kernels that registered a host-side shadow via __hipRegisterVar. Device-linked programs (e.g. RCCL) link the instrumented code object directly into the device image with no host shadow, so their counters are never drained. Add a supplemental, Linux-only drain that introspects the loaded code objects via the HSA runtime: it walks each GPU agent, enumerates only the code objects actually resident there, reads each one's __llvm_profile_sections, and routes them through the existing processDeviceOffloadPrf() path so the emitted profraw layout is identical. A small content-dedup set keyed on the (data, counters, names) device-pointer triple ensures a section already drained by the host-shadow pass is not drained again, so the two passes compose without double-counting. HSA is brought up lazily from the drain (never from a library constructor) to avoid poisoning fork-based callers. Because the HSA walk only ever touches resident code objects, it also makes the host-shadow pass's collect-all fallback unnecessary on Linux: when no kernel launch was tracked (a program that never launches, collects before its first launch, or launches only via an untracked API), the host-shadow pass is skipped and the HSA drain covers it safely instead of faulting/hanging on a non-resident device on a multi-GPU host. Link the device profile runtime on both the new-offload-driver (LinkerWrapper::ConstructJob) and traditional (HIPAMD constructLldCommand) link paths so instrumented device images resolve the runtime symbols. On the host side, link clang_rt.profile_rocm -- the self-contained runtime variant that carries the device-counter drain and the hipModuleLoad interceptor -- for any instrumented host link on a ROCm-equipped toolchain, not only for links with an active HIP offload action (OFK_HIP). HIP host code is frequently linked into a shared library or executable from pre-compiled objects (e.g. RCCL's librccl.so is linked from .o inputs by a plain clang++ -shared); such a link carries no OFK_HIP yet still needs the drain, and gating on it left those .profraw files with host counters only. profile_rocm is emitted ahead of the base clang_rt.profile (which stays inert), guarded by an existence check that leaves lean toolchains unchanged, and both the interceptor and the drain self-skip when the process has no resident device code. Also add a GPU-executed test suite (compiler-rt/test/profile/{GPU,AMDGPU}/*.hip) and a dependency-free "lit-lite" runner (run_gpu_tests.py) so the device drain can be exercised on a real AMD GPU runner: basic/coverage/pgo-use, multi-kernel, device-branching, multi-GPU and non-default-device drain + dedup, early-collect / no-kernel edges, RDC vs non-RDC __llvm_profile_sections, fork-safety (the RCCL parent-no-HIP / kernel-in-forked-child pattern), quantitative device-counter correctness, multi-process offline accumulation, and explicit-collect idempotency. A standalone device-pgo/ build helper reproduces the toolchain locally. The object-only host-link path is exercised by clang/test/Driver/hip-profile-rocm-runtime.hip. Co-authored-by: Cursor <[email protected]> --- clang/lib/Driver/ToolChains/Clang.cpp | 15 + clang/lib/Driver/ToolChains/HIPAMD.cpp | 20 + .../profile/InstrProfilingPlatformROCm.cpp | 682 ++++++++++++++++-- .../test/profile/AMDGPU/device-basic.hip | 67 ++ .../profile/AMDGPU/device-early-collect.hip | 68 ++ .../test/profile/AMDGPU/device-no-kernel.hip | 44 ++ .../test/profile/AMDGPU/device-symbols.hip | 42 ++ .../test/profile/AMDGPU/lit.local.cfg.py | 4 + .../test/profile/GPU/instrprof-hip-basic.hip | 51 ++ .../GPU/instrprof-hip-collect-after.hip | 63 ++ .../GPU/instrprof-hip-counter-correctness.hip | 56 ++ .../profile/GPU/instrprof-hip-coverage.hip | 51 ++ .../GPU/instrprof-hip-device-branching.hip | 67 ++ .../profile/GPU/instrprof-hip-fork-safety.hip | 61 ++ .../profile/GPU/instrprof-hip-multi-gpu.hip | 57 ++ .../GPU/instrprof-hip-multi-process-merge.hip | 63 ++ .../GPU/instrprof-hip-multiple-kernels.hip | 58 ++ .../GPU/instrprof-hip-nondefault-device.hip | 60 ++ .../profile/GPU/instrprof-hip-pgo-use.hip | 63 ++ compiler-rt/test/profile/device-pgo/README.md | 125 ++++ compiler-rt/test/profile/device-pgo/build.sh | 56 ++ .../profile/device-pgo/toolchain-cache.cmake | 55 ++ compiler-rt/test/profile/run_gpu_tests.py | 408 +++++++++++ 23 files changed, 2184 insertions(+), 52 deletions(-) create mode 100644 compiler-rt/test/profile/AMDGPU/device-basic.hip create mode 100644 compiler-rt/test/profile/AMDGPU/device-early-collect.hip create mode 100644 compiler-rt/test/profile/AMDGPU/device-no-kernel.hip create mode 100644 compiler-rt/test/profile/AMDGPU/device-symbols.hip create mode 100644 compiler-rt/test/profile/AMDGPU/lit.local.cfg.py create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-basic.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip create mode 100644 compiler-rt/test/profile/device-pgo/README.md create mode 100755 compiler-rt/test/profile/device-pgo/build.sh create mode 100644 compiler-rt/test/profile/device-pgo/toolchain-cache.cmake create mode 100644 compiler-rt/test/profile/run_gpu_tests.py diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp index c2ac478d84929..3b8bc46820af6 100644 --- a/clang/lib/Driver/ToolChains/Clang.cpp +++ b/clang/lib/Driver/ToolChains/Clang.cpp @@ -9658,6 +9658,21 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA, (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX())) LinkerArgs.emplace_back("-lompdevice"); + // With PGO/coverage instrumentation, GPU device code references the + // device profile runtime (__llvm_profile_instrument_gpu and the + // __llvm_profile_sections bounds table emitted by + // InstrProfilingPlatformGPU). The offload device link does not otherwise + // pull it in, so forward the static device profile runtime to the GPU + // device linker. The archive is arch-suffixed, so pass its full path + // rather than a -l name. + if (ToolChain::needsProfileRT(Args) && + (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX())) { + std::string ProfileRT = + TC->getCompilerRT(Args, "profile", ToolChain::FT_Static); + if (TC->getVFS().exists(ProfileRT)) + LinkerArgs.emplace_back(Args.MakeArgString(ProfileRT)); + } + // For SPIR-V, pass some extra flags to `spirv-link`, the out-of-tree // SPIR-V linker. `spirv-link` isn't called in LTO mode so restrict these // flags to normal compilation. diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp index 01cb23d0aa230..1bd4e073b4e27 100644 --- a/clang/lib/Driver/ToolChains/HIPAMD.cpp +++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp @@ -19,6 +19,7 @@ #include "clang/Options/Options.h" #include "llvm/Support/FileSystem.h" #include "llvm/Support/Path.h" +#include "llvm/Support/VirtualFileSystem.h" #include "llvm/TargetParser/TargetParser.h" using namespace clang::driver; @@ -142,6 +143,25 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA, LldArgs.push_back("--no-whole-archive"); + // With PGO/coverage instrumentation, instrumented device code references the + // device profile runtime (__llvm_profile_instrument_gpu and the + // __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU). + // The new-offload-driver path injects this in LinkerWrapper::ConstructJob, + // but HIP using the traditional offload path (e.g. on Windows, which does not + // route device linking through clang-linker-wrapper) reaches the device link + // here instead. Forward the static device profile runtime to this lld device + // link so the runtime is pulled in regardless of offload-driver/host OS. The + // archive is arch-suffixed, so pass its full path rather than a -l name. + if (ToolChain::needsProfileRT(Args)) { + std::string ProfileRT = + TC.getCompilerRT(Args, "profile", ToolChain::FT_Static); + // Use the ToolChain VFS (matches the new-offload-driver path in + // Clang.cpp) so overlay/virtual filesystems used by the driver are + // honored; llvm::sys::fs bypasses them and can wrongly skip the runtime. + if (TC.getVFS().exists(ProfileRT)) + LldArgs.push_back(Args.MakeArgString(ProfileRT)); + } + const char *Lld = Args.MakeArgStringRef(getToolChain().GetProgramPath("lld")); C.addCommand(std::make_unique<Command>(JA, *this, ResponseFileSupport::None(), Lld, LldArgs, Inputs, Output)); diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp index d0d9b1ea8f61d..b1db1d8a74041 100644 --- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp +++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp @@ -66,6 +66,15 @@ struct OffloadSectionShadowGroup; static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target, const OffloadSectionShadowGroup *Sections); +#if defined(__linux__) && !defined(_WIN32) +// Record a drained section-bounds tuple so the supplemental HSA-introspection +// pass (Linux only) skips any code object the host-shadow path already +// drained. Defined alongside the HSA drain below; forward-declared here so +// processDeviceOffloadPrf can register every successful host-shadow drain. +static void profRecordDrainedBounds(const void *Data, const void *Counters, + const void *Names); +#endif + static int isVerboseMode() { static int IsVerbose = -1; if (IsVerbose == -1) @@ -1119,8 +1128,14 @@ static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target, if (ret != 0) { PROF_ERR("%s\n", "failed to write device profile using shared API"); - } else if (isVerboseMode()) { - PROF_NOTE("%s\n", "Successfully wrote device profile using shared API"); + } else { +#if defined(__linux__) && !defined(_WIN32) + // Dedup against the supplemental HSA pass: this section is now drained, so + // the HSA walk must not drain the same device code object again. + profRecordDrainedBounds(DevDataBegin, DevCntsBegin, DevNamesBegin); +#endif + if (isVerboseMode()) + PROF_NOTE("%s\n", "Successfully wrote device profile using shared API"); } return ret; @@ -1148,72 +1163,635 @@ static int isHipAvailable(void) { return pHipMemcpy != nullptr && pHipGetSymbolAddress != nullptr; } -/* -------------------------------------------------------------------------- */ -/* Collect device-side profile data */ -/* -------------------------------------------------------------------------- */ +/* ========================================================================== */ +/* Supplemental HSA-introspection drain (Linux only) */ +/* */ +/* The host-shadow drain above only sees device code objects registered */ +/* host-side (__hipRegisterVar shadows) or loaded through an intercepted */ +/* hipModuleLoad* call. Device code linked by the offload device linker with */ +/* no host-side shadow -- e.g. RCCL, whose many device functions are glued */ +/* into a single kernel with no source module -- is invisible to it. This */ +/* pass walks every GPU agent's loaded executables via HSA, finds each */ +/* __llvm_profile_sections table directly on the device, and drains the ones */ +/* the host-shadow pass did not already handle (deduped by the device */ +/* section-bounds tuple). It reuses processDeviceOffloadPrf() for the */ +/* copy/relocate/write so the on-disk profraw layout is identical. */ +/* ========================================================================== */ +#if defined(__linux__) && !defined(_WIN32) -extern "C" int __llvm_profile_hip_collect_device_data(void) { - if (NumShadowVariables == 0 && NumDynamicModules == 0) +/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at + * build time, so mirror just the handful of HSA declarations the drain needs. + * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */ +typedef uint32_t prof_hsa_status_t; +#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0) +#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1) + +typedef struct { + uint64_t handle; +} prof_hsa_agent_t; +typedef struct { + uint64_t handle; +} prof_hsa_executable_t; +typedef struct { + uint64_t handle; +} prof_hsa_executable_symbol_t; + +typedef uint32_t prof_hsa_agent_info_t; +#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0) +#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17) + +typedef uint32_t prof_hsa_device_type_t; +#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1) + +typedef uint32_t prof_hsa_symbol_kind_t; +#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0) + +typedef uint32_t prof_hsa_executable_symbol_info_t; +#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE \ + ((prof_hsa_executable_symbol_info_t)0) +#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH \ + ((prof_hsa_executable_symbol_info_t)1) +#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME \ + ((prof_hsa_executable_symbol_info_t)2) +#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS \ + ((prof_hsa_executable_symbol_info_t)21) + +#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201) + +typedef uint32_t prof_hsa_loader_storage_type_t; + +typedef struct { + prof_hsa_agent_t agent; + prof_hsa_executable_t executable; + prof_hsa_loader_storage_type_t code_object_storage_type; + const void *code_object_storage_base; + size_t code_object_storage_size; + size_t code_object_storage_offset; + const void *segment_base; + size_t segment_size; +} prof_hsa_loader_segment_descriptor_t; + +typedef prof_hsa_status_t (*hsa_init_ty)(void); +typedef prof_hsa_status_t (*hsa_iterate_agents_ty)( + prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *); +typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t, + prof_hsa_agent_info_t, + void *); +typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)( + prof_hsa_executable_t, prof_hsa_agent_t, + prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t, + prof_hsa_executable_symbol_t, void *), + void *); +typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)( + prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *); +typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t, + uint16_t, + size_t, + void *); +typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)( + prof_hsa_loader_segment_descriptor_t *, size_t *); + +/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only + * query_segment_descriptors is used; query_host_address keeps the offset. */ +typedef struct { + void *query_host_address; + hsa_loader_query_segment_descriptors_ty query_segment_descriptors; +} prof_hsa_loader_pfn_t; + +static hsa_iterate_agents_ty pHsaIterateAgents = nullptr; +static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr; +static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr; +static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr; +static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr; + +/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/ + * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees + * the fully-written p* function pointers (published before the release store + * of HsaRuntimeState=1 below). */ +static int HsaRuntimeState = 0; + +static int setHsaRuntimeState(int S) { + __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE); + return S > 0 ? 0 : -1; +} + +/* Resolve HSA entry points (and the AMD loader extension) once, and confirm + * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is + * resolved by the shared ensureHipLoaded() above. */ +static int loadHsaRuntimePointers(void) { + int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE); + if (State) + return State > 0 ? 0 : -1; + + if (!__interception::DynamicLoaderAvailable()) { + if (isVerboseMode()) + PROF_NOTE("%s", "Dynamic library loading not available - " + "HSA device profiling disabled\n"); + return setHsaRuntimeState(-1); + } + + void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so"); + if (!Hsa) + Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1"); + if (!Hsa) { + if (isVerboseMode()) + PROF_NOTE("%s", "libhsa-runtime64.so not loadable - " + "HSA device profiling disabled\n"); + return setHsaRuntimeState(-1); + } + + hsa_init_ty pHsaInit = + (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init"); + hsa_system_get_major_extension_table_ty pGetExtTable = + (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol( + Hsa, "hsa_system_get_major_extension_table"); + pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol( + Hsa, "hsa_iterate_agents"); + pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol( + Hsa, "hsa_agent_get_info"); + pHsaExecIterAgentSyms = + (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol( + Hsa, "hsa_executable_iterate_agent_symbols"); + pHsaSymGetInfo = + (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol( + Hsa, "hsa_executable_symbol_get_info"); + + if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo || + !pHsaExecIterAgentSyms || !pHsaSymGetInfo) { + PROF_WARN("%s", + "required HSA symbols missing - HSA device profiling disabled\n"); + return setHsaRuntimeState(-1); + } + + /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain + * rather than from the library constructor, so merely loading the + * instrumented library does not initialize HSA in the process -- which would + * break fork-based callers that deliberately keep HIP/HSA uninitialized in + * the parent (see the constructor note at the end of the HSA block). In the + * common case the drain runs from the profile write path while HSA is still + * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed, + * this simply re-initializes HSA (the process is exiting anyway). */ + prof_hsa_status_t St = pHsaInit(); + if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) { + if (isVerboseMode()) + PROF_NOTE("hsa_init failed (0x%x) - HSA device profiling disabled\n", St); + return setHsaRuntimeState(-1); + } + + prof_hsa_loader_pfn_t LoaderApi; + __builtin_memset(&LoaderApi, 0, sizeof(LoaderApi)); + St = pGetExtTable(PROF_HSA_EXTENSION_AMD_LOADER, 1, sizeof(LoaderApi), + &LoaderApi); + if (St != PROF_HSA_STATUS_SUCCESS || !LoaderApi.query_segment_descriptors) { + PROF_WARN("AMD loader extension unavailable (0x%x) - " + "HSA device profiling disabled\n", + St); + return setHsaRuntimeState(-1); + } + pQuerySegDescs = LoaderApi.query_segment_descriptors; + + /* The device-to-host copies go through the shared HIP loader. */ + ensureHipLoaded(); + if (!pHipMemcpy) { + PROF_WARN("%s", "hipMemcpy unavailable - HSA device profiling disabled\n"); + return setHsaRuntimeState(-1); + } + + if (isVerboseMode()) + PROF_NOTE("%s", "HSA + HIP runtime resolved for device profiling\n"); + return setHsaRuntimeState(1); +} + +/* The canonical device bounds-table symbol from InstrProfilingPlatformGPU.c. */ +static const char ProfileSectionsSymbol[] = "__llvm_profile_sections"; + +/* Dedup of drained section-bounds tuples, shared with the host-shadow path + * (processDeviceOffloadPrf records here on every successful drain). A single + * linked device code object exposes one __llvm_profile_sections, but the same + * bounds may be seen via multiple agents, so each unique counter set is + * drained exactly once across both paths. */ +namespace { +struct ProfBoundsTuple { + const void *data; + const void *cnts; + const void *names; +}; +} // namespace + +#define PROF_MAX_SEEN_BOUNDS 256 +static ProfBoundsTuple SeenBounds[PROF_MAX_SEEN_BOUNDS]; +static int NumSeenBounds = 0; + +/* Pure check: has this bounds tuple already been drained? Does not mutate + * state, so a transient failure does not permanently suppress retries. */ +static int profBoundsAlreadyDrained(const void *D, const void *C, + const void *N) { + for (int i = 0; i < NumSeenBounds; ++i) + if (SeenBounds[i].data == D && SeenBounds[i].cnts == C && + SeenBounds[i].names == N) + return 1; + return 0; +} + +/* Record a drained bounds tuple. Idempotent. Called after a successful drain + * (either path) so a failed attempt stays retryable. */ +static void profRecordDrainedBounds(const void *D, const void *C, + const void *N) { + if (profBoundsAlreadyDrained(D, C, N)) + return; + if (NumSeenBounds < PROF_MAX_SEEN_BOUNDS) { + SeenBounds[NumSeenBounds].data = D; + SeenBounds[NumSeenBounds].cnts = C; + SeenBounds[NumSeenBounds].names = N; + NumSeenBounds++; + } +} + +#define PROF_MAX_GPU_AGENTS 64 + +namespace { +struct GpuAgent { + prof_hsa_agent_t agent; + char arch[64]; +}; + +struct WalkState { + GpuAgent agents[PROF_MAX_GPU_AGENTS]; + int num_agents; + int total_found; + int total_drained; +}; + +/* Per (agent, executable) symbol-iteration state. */ +struct SymbolState { + const char *arch; + int found; + int drained; +}; +} // namespace + +/* HSA per-symbol callback: when it finds a __llvm_profile_sections variable, + * drain it via processDeviceOffloadPrf() unless the host-shadow path (or an + * earlier agent) already handled the same bounds. */ +static prof_hsa_status_t onSymbol(prof_hsa_executable_t, prof_hsa_agent_t, + prof_hsa_executable_symbol_t Sym, + void *Data) { + SymbolState *S = (SymbolState *)Data; + + prof_hsa_symbol_kind_t Kind; + if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE, &Kind) != + PROF_HSA_STATUS_SUCCESS || + Kind != PROF_HSA_SYMBOL_KIND_VARIABLE) + return PROF_HSA_STATUS_SUCCESS; + + uint32_t NameLen = 0; + if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH, + &NameLen) != PROF_HSA_STATUS_SUCCESS || + NameLen != sizeof(ProfileSectionsSymbol) - 1) + return PROF_HSA_STATUS_SUCCESS; + + char NameBuf[64]; + if (NameLen + 1 > sizeof(NameBuf)) + return PROF_HSA_STATUS_SUCCESS; + if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME, NameBuf) != + PROF_HSA_STATUS_SUCCESS) + return PROF_HSA_STATUS_SUCCESS; + NameBuf[NameLen] = '\0'; + + if (strcmp(NameBuf, ProfileSectionsSymbol) != 0) + return PROF_HSA_STATUS_SUCCESS; + + uint64_t Addr = 0; + if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS, + &Addr) != PROF_HSA_STATUS_SUCCESS || + Addr == 0) { + if (isVerboseMode()) + PROF_NOTE("%s", "failed to read __llvm_profile_sections address\n"); + return PROF_HSA_STATUS_SUCCESS; + } + + S->found++; + + // Read the bounds table first to dedup (and detect empty sections) before + // the full copy/relocate done by processDeviceOffloadPrf. + __llvm_profile_gpu_sections Sec; + if (memcpyDeviceToHost(&Sec, (void *)(uintptr_t)Addr, sizeof(Sec)) != 0) { + PROF_WARN("%s", "failed to copy device bounds table\n"); + return PROF_HSA_STATUS_SUCCESS; + } + if (profBoundsAlreadyDrained(Sec.DataStart, Sec.CountersStart, + Sec.NamesStart)) { + if (isVerboseMode()) + PROF_NOTE("%s", "device bounds already drained, skipping\n"); + return PROF_HSA_STATUS_SUCCESS; + } + + size_t DataBytes = (const char *)Sec.DataStop - (const char *)Sec.DataStart; + size_t CntsBytes = + (const char *)Sec.CountersStop - (const char *)Sec.CountersStart; + if (DataBytes == 0 || CntsBytes == 0) { + // Empty code object: nothing to write. Mark seen so we don't revisit it. + profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart); + return PROF_HSA_STATUS_SUCCESS; + } + + // Generate a collision-free target. Multiple distinct device code objects on + // the same arch (e.g. non-RDC multi-TU) must not clobber each other's file. + static int DrainIndex = 0; + char Target[96]; + if (DrainIndex == 0) + snprintf(Target, sizeof(Target), "%s", S->arch); + else + snprintf(Target, sizeof(Target), "%s.%d", S->arch, DrainIndex); + + // processDeviceOffloadPrf returns 0 on a successful write, -1 on error. + // Record the bounds (and advance the target index) only on success so a + // transient error stays retryable on a later agent or collect call. + int Rc = processDeviceOffloadPrf((void *)(uintptr_t)Addr, Target, nullptr); + if (Rc == 0) { + S->drained++; + DrainIndex++; + profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart); + } + + return PROF_HSA_STATUS_SUCCESS; +} + +static prof_hsa_status_t collectAgent(prof_hsa_agent_t Agent, void *Data) { + prof_hsa_device_type_t DevType; + if (pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_DEVICE, &DevType) != + PROF_HSA_STATUS_SUCCESS || + DevType != PROF_HSA_DEVICE_TYPE_GPU) + return PROF_HSA_STATUS_SUCCESS; + + WalkState *W = (WalkState *)Data; + if (W->num_agents >= PROF_MAX_GPU_AGENTS) + return PROF_HSA_STATUS_SUCCESS; + + GpuAgent &GA = W->agents[W->num_agents++]; + GA.agent = Agent; + char Name[64]; + __builtin_memset(Name, 0, sizeof(Name)); + pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_NAME, Name); + size_t N = strnlen(Name, sizeof(GA.arch) - 1); + __builtin_memcpy(GA.arch, Name, N); + GA.arch[N] = '\0'; + if (!GA.arch[0]) + strncpy(GA.arch, "amdgpu", sizeof(GA.arch) - 1); + + if (isVerboseMode()) + PROF_NOTE("GPU agent %d: %s\n", W->num_agents - 1, GA.arch); + return PROF_HSA_STATUS_SUCCESS; +} + +/* Reentrancy guard and "drained data at least once" latch. The collect hook + * may run more than once (an explicit early __llvm_profile_write_file plus the + * exit write); a successful walk latches HsaDrainCompleted so we never re-emit + * duplicate .profraw files, while transient no-op outcomes ("runtime not yet + * loadable", "no GPU agents", "no loaded segments", "nothing instrumented") + * stay retryable so a later call can still pick up code objects loaded later. + * HsaDrainInProgress prevents a concurrent or reentrant call (e.g. a library + * destructor) from corrupting the global SeenBounds table. Both flags use + * acquire/release atomics. */ +static int HsaDrainInProgress = 0; +static int HsaDrainCompleted = 0; + +static int drainDevicesViaHsa(void) { + if (__atomic_load_n(&HsaDrainCompleted, __ATOMIC_ACQUIRE)) return 0; - if (!isHipAvailable()) + int Expected = 0; + if (!__atomic_compare_exchange_n(&HsaDrainInProgress, &Expected, 1, + /*weak=*/0, __ATOMIC_ACQ_REL, + __ATOMIC_ACQUIRE)) return 0; - int Ret = 0; + struct InProgressGuard { + ~InProgressGuard() { + __atomic_store_n(&HsaDrainInProgress, 0, __ATOMIC_RELEASE); + } + } _Guard; - /* Shadow variables (static-linked kernels): drain from every device. */ - if (NumShadowVariables > 0) { - int OrigDevice = -1; - hipGetDevice(&OrigDevice); + if (loadHsaRuntimePointers() != 0) + return 0; /* Runtime unavailable: stay retryable. */ - for (int Dev = 0; Dev < NumDevices; ++Dev) { - if (!shouldCollectDevice(Dev)) { - if (isVerboseMode()) - PROF_NOTE("Skipping unused device %d\n", Dev); - continue; + WalkState W; + __builtin_memset(&W, 0, sizeof(W)); + prof_hsa_status_t St = pHsaIterateAgents(collectAgent, &W); + if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) { + PROF_WARN("hsa_iterate_agents failed (0x%x)\n", St); + return -1; + } + if (W.num_agents == 0) { + if (isVerboseMode()) + PROF_NOTE("%s", "no GPU agents present; nothing to drain (will retry)\n"); + return 0; + } + + /* query_segment_descriptors ships in every loader-extension version and is + * more permissive than iterate_executables on ROCm. It yields the loaded + * (agent, executable) pairs directly. */ + size_t NumSegs = 0; + St = pQuerySegDescs(nullptr, &NumSegs); + if (St != PROF_HSA_STATUS_SUCCESS) { + PROF_WARN("query_segment_descriptors(count) failed (0x%x)\n", St); + return -1; + } + if (NumSegs == 0) { + if (isVerboseMode()) + PROF_NOTE("%s", "no loaded segments; nothing to drain (will retry)\n"); + return 0; + } + + prof_hsa_loader_segment_descriptor_t *Segs = + (prof_hsa_loader_segment_descriptor_t *)calloc(NumSegs, sizeof(*Segs)); + if (!Segs) { + PROF_ERR("%s\n", "failed to allocate segment descriptor array"); + return -1; + } + UniqueFree SegsOwner(Segs); + + St = pQuerySegDescs(Segs, &NumSegs); + if (St != PROF_HSA_STATUS_SUCCESS) { + PROF_WARN("query_segment_descriptors(fetch) failed (0x%x)\n", St); + return -1; + } + + if (isVerboseMode()) + PROF_NOTE("query_segment_descriptors: %zu segments\n", NumSegs); + + /* Walk unique (agent, executable) pairs. */ + enum { kMaxPairs = 512 }; + uint64_t SeenAgents[kMaxPairs]; + uint64_t SeenExecs[kMaxPairs]; + int NumPairs = 0; + int IterFailures = 0; + + for (size_t i = 0; i < NumSegs; ++i) { + if (Segs[i].executable.handle == 0 || Segs[i].agent.handle == 0) + continue; + + int Seen = 0; + for (int j = 0; j < NumPairs; ++j) + if (SeenAgents[j] == Segs[i].agent.handle && + SeenExecs[j] == Segs[i].executable.handle) { + Seen = 1; + break; } - if (hipSetDevice(Dev) != 0) { - if (isVerboseMode()) - PROF_NOTE("Failed to set device %d, skipping\n", Dev); - continue; + if (Seen) + continue; + if (NumPairs < kMaxPairs) { + SeenAgents[NumPairs] = Segs[i].agent.handle; + SeenExecs[NumPairs] = Segs[i].executable.handle; + NumPairs++; + } + + const char *Arch = nullptr; + for (int k = 0; k < W.num_agents; ++k) + if (W.agents[k].agent.handle == Segs[i].agent.handle) { + Arch = W.agents[k].arch; + break; } - const char *ArchName = getDeviceArchName(Dev); - if (isVerboseMode()) - PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev, - ArchName); - for (int i = 0; i < NumShadowVariables; ++i) { - /* RDC-mode multi-shadow drains need a distinct profraw per TU; - * single-TU programs keep the bare arch target. */ - const char *Target = ArchName; - char TargetWithIdx[64]; - if (NumShadowVariables > 1) { - snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName, i); - Target = TargetWithIdx; + if (!Arch) + continue; /* not a GPU agent we collected */ + + SymbolState S; + __builtin_memset(&S, 0, sizeof(S)); + S.arch = Arch; + if (isVerboseMode()) + PROF_NOTE("walking executable 0x%llx on %s\n", + (unsigned long long)Segs[i].executable.handle, Arch); + prof_hsa_status_t IterSt = + pHsaExecIterAgentSyms(Segs[i].executable, Segs[i].agent, onSymbol, &S); + if (IterSt != PROF_HSA_STATUS_SUCCESS && + IterSt != PROF_HSA_STATUS_INFO_BREAK) { + PROF_WARN("hsa_executable_iterate_agent_symbols on executable 0x%llx " + "failed (0x%x)\n", + (unsigned long long)Segs[i].executable.handle, IterSt); + IterFailures++; + } + W.total_found += S.found; + W.total_drained += S.drained; + } + + if (isVerboseMode()) + PROF_NOTE("HSA walk complete: agents=%d pairs=%d found=%d drained=%d " + "iter-failures=%d\n", + W.num_agents, NumPairs, W.total_found, W.total_drained, + IterFailures); + + /* Latch only when we actually drained data. Deliberately do NOT latch the + * "walked everything but found nothing new" case: an early collect call can + * run before any kernel launch, and latching it would suppress the real + * exit-time drain once kernels do run. Repeating a no-op walk is cheap. */ + if (W.total_drained > 0) + __atomic_store_n(&HsaDrainCompleted, 1, __ATOMIC_RELEASE); + return (IterFailures > 0) ? -1 : 0; +} + +/* NOTE: deliberately no library constructor that calls hsa_init() here. + * Bringing HSA up merely because the instrumented library was loaded poisons + * fork-based callers: frameworks and tests (e.g. RCCL's unit tests) keep + * HIP/HSA uninitialized in the parent and only touch HIP inside forked + * children. A parent that has already hsa_init()'d makes those children crash + * inside HSA (HSA state is not valid across fork()). HSA is instead brought up + * lazily from drainDevicesViaHsa() -> loadHsaRuntimePointers(); see the init + * rationale there. */ + +#endif /* defined(__linux__) && !defined(_WIN32) -- HSA drain */ + +/* -------------------------------------------------------------------------- */ +/* Collect device-side profile data */ +/* -------------------------------------------------------------------------- */ + +extern "C" int __llvm_profile_hip_collect_device_data(void) { + int Ret = 0; + + /* Host-shadow drain: static-linked kernels (host __hipRegisterVar shadows) + * and intercepted dynamic modules. Only meaningful when something registered + * host-side; skipped entirely for pure device-linked programs (RCCL), which + * the supplemental HSA pass below handles. */ + if ((NumShadowVariables != 0 || NumDynamicModules != 0) && isHipAvailable()) { + /* Shadow variables (static-linked kernels): drain from every device. */ + if (NumShadowVariables > 0) { + int OrigDevice = -1; + hipGetDevice(&OrigDevice); + + for (int Dev = 0; Dev < NumDevices; ++Dev) { + if (!shouldCollectDevice(Dev)) { + if (isVerboseMode()) + PROF_NOTE("Skipping unused device %d\n", Dev); + continue; + } +#if defined(__linux__) && !defined(_WIN32) + /* When no kernel launch was tracked at all, shouldCollectDevice() + * falls back to collect-all, which can fault/hang reading a + * non-resident device's sections on a multi-GPU host (e.g. a program + * that never launches, collects before its first launch, or launches + * only via an untracked API). On Linux the supplemental HSA drain + * below covers those cases safely -- it walks only code objects + * actually resident on each agent -- so skip the host-shadow pass + * entirely rather than take the unsafe fallback. */ + if (!__atomic_load_n(&AnyDeviceUsed, __ATOMIC_ACQUIRE)) { + if (isVerboseMode()) + PROF_NOTE("No tracked launch; deferring device %d to HSA drain\n", + Dev); + continue; + } +#endif + if (hipSetDevice(Dev) != 0) { + if (isVerboseMode()) + PROF_NOTE("Failed to set device %d, skipping\n", Dev); + continue; + } + const char *ArchName = getDeviceArchName(Dev); + if (isVerboseMode()) + PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev, + ArchName); + for (int i = 0; i < NumShadowVariables; ++i) { + /* RDC-mode multi-shadow drains need a distinct profraw per TU; + * single-TU programs keep the bare arch target. */ + const char *Target = ArchName; + char TargetWithIdx[64]; + if (NumShadowVariables > 1) { + snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName, + i); + Target = TargetWithIdx; + } + if (processShadowVariable(i, Target) != 0) + Ret = -1; } - if (processShadowVariable(i, Target) != 0) - Ret = -1; } - } - if (OrigDevice >= 0) - hipSetDevice(OrigDevice); - } + if (OrigDevice >= 0) + hipSetDevice(OrigDevice); + } - /* Warn about unprocessed TUs; skip cleared slots (already drained). */ - lockDynamicModules(); - for (int i = 0; i < NumDynamicModules; ++i) { - OffloadDynamicModuleInfo *MI = &DynamicModules[i]; - if (!MI->ModulePtr) - continue; - for (int t = 0; t < MI->NumTUs; ++t) { - if (!MI->TUs[t].Processed) { - PROF_WARN("dynamic module %p TU %d was not processed before exit\n", - MI->ModulePtr, t); - Ret = -1; + /* Warn about unprocessed TUs; skip cleared slots (already drained). */ + lockDynamicModules(); + for (int i = 0; i < NumDynamicModules; ++i) { + OffloadDynamicModuleInfo *MI = &DynamicModules[i]; + if (!MI->ModulePtr) + continue; + for (int t = 0; t < MI->NumTUs; ++t) { + if (!MI->TUs[t].Processed) { + PROF_WARN("dynamic module %p TU %d was not processed before exit\n", + MI->ModulePtr, t); + Ret = -1; + } } } + unlockDynamicModules(); } - unlockDynamicModules(); + +#if defined(__linux__) && !defined(_WIN32) + /* Supplemental HSA-introspection drain: catches device code objects with no + * host-side shadow (e.g. RCCL device-linked kernels). Runs after the + * host-shadow drain so already-drained sections are deduped out, and runs + * even when there are no host shadows at all (the common RCCL case). */ + if (drainDevicesViaHsa() != 0) + Ret = -1; +#endif if (Ret != 0) PROF_WARN("%s\n", "failed to collect device profile data"); diff --git a/compiler-rt/test/profile/AMDGPU/device-basic.hip b/compiler-rt/test/profile/AMDGPU/device-basic.hip new file mode 100644 index 0000000000000..4fcf044802240 --- /dev/null +++ b/compiler-rt/test/profile/AMDGPU/device-basic.hip @@ -0,0 +1,67 @@ +// Basic HIP device PGO drain end-to-end: a host + device .profraw are written +// at exit (the device one arch-prefixed), they merge, the merged profile +// contains the device kernel's counters, and llvm-cov reports device-side +// coverage. Covers both non-RDC and RDC device compiles. +// +// REQUIRES: hip, amdgpu + +// RUN: rm -rf %t.dir && mkdir -p %t.dir + +// --- non-RDC --- +// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: %t.dir/a.out +// A device profraw (arch-prefixed) must have been drained alongside the host one. +// RUN: ls %t.dir/gfx*.profraw +// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata +// RUN: llvm-profdata show --all-functions %t.dir/a.profdata \ +// RUN: | FileCheck --check-prefix=FUNCS %s +// Confirm the embedded device image is extractable (failure here is the real +// cause of any downstream llvm-cov failure, so let it propagate). +// RUN: llvm-objdump --offloading %t.dir/a.out > /dev/null +// RUN: llvm-cov report %t.dir/a.out.0.hip-amdgcn-amd-amdhsa--*gfx* \ +// RUN: -instr-profile=%t.dir/a.profdata 2>&1 | FileCheck --check-prefix=COV %s + +// --- RDC --- +// RUN: rm -f %t.dir/*.profraw +// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fgpu-rdc \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/b.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: %t.dir/b.out +// RUN: ls %t.dir/gfx*.profraw +// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/b.profdata +// RUN: llvm-profdata show --all-functions %t.dir/b.profdata \ +// RUN: | FileCheck --check-prefix=FUNCS %s + +#include <hip/hip_runtime.h> + +__global__ void addk(int *p) { + if (*p > 0) + *p += 1; + else + *p -= 1; +} + +int main() { + int *d = nullptr; + if (hipMalloc(&d, sizeof(int)) != hipSuccess) + return 2; + int h = 5; + (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice); + addk<<<1, 1>>>(d); + (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + return h > 0 ? 0 : 1; +} + +// The merged profile contains both the host main and the device kernel, +// proving the device counters were drained and merged. +// FUNCS-DAG: addk +// FUNCS-DAG: main + +// COV: TOTAL diff --git a/compiler-rt/test/profile/AMDGPU/device-early-collect.hip b/compiler-rt/test/profile/AMDGPU/device-early-collect.hip new file mode 100644 index 0000000000000..3e2c6e84e26c2 --- /dev/null +++ b/compiler-rt/test/profile/AMDGPU/device-early-collect.hip @@ -0,0 +1,68 @@ +// M1 regression: calling __llvm_profile_hip_collect_device_data() before any +// kernel has been launched must not poison the later atexit drain. The early +// call sees "no instrumented code object loaded yet" (a transient no-op) and +// must not latch the drain as completed; otherwise the post-launch atexit +// pass produces no device .profraw and we silently lose device counters. +// +// REQUIRES: hip, amdgpu +// Guards the Linux introspection drain's DrainCompleted latch; the Windows +// host-shadow drain has no such latch (it tracks per-TU Processed flags). +// UNSUPPORTED: windows + +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: %t.dir/a.out +// Both the host profraw and at least one device profraw (gfx-prefixed) must +// have been produced, despite the early collection attempt. +// RUN: ls %t.dir/host.*.profraw +// RUN: ls %t.dir/gfx*.profraw +// And the merged profile must contain the device kernel that was launched +// *after* the early collect. +// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata +// RUN: llvm-profdata show --all-functions %t.dir/a.profdata \ +// RUN: | FileCheck %s + +#include <hip/hip_runtime.h> + +// Declared by libclang_rt.profile-<host arch>.a; we call it directly to +// simulate any caller that drains device counters at an arbitrary point in +// the program lifetime (e.g. a per-iteration profile dump). +extern "C" int __llvm_profile_hip_collect_device_data(void); + +__global__ void post_collect_kernel(int *p) { + if (*p > 0) + *p += 1; + else + *p -= 1; +} + +int main() { + // (1) Early collection -- runs before any kernel launch. The drainer + // finds either no GPU agents, no loaded segments, or no instrumented + // bounds table, and returns 0 without latching DrainCompleted. + (void)__llvm_profile_hip_collect_device_data(); + + // (2) Now launch a kernel. HIP loads the device code object that carries + // the __llvm_profile_sections bounds table, executes our kernel, and + // populates the device-side counters. + int *d = nullptr; + if (hipMalloc(&d, sizeof(int)) != hipSuccess) + return 2; + int h = 5; + (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice); + post_collect_kernel<<<1, 1>>>(d); + (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + + // (3) Exit normally. The atexit drain runs and -- because step (1) did + // not latch DrainCompleted -- it walks the (now loaded) code object, + // finds __llvm_profile_sections, and emits the device .profraw. + return h > 0 ? 0 : 1; +} + +// CHECK-DAG: post_collect_kernel +// CHECK-DAG: main diff --git a/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip b/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip new file mode 100644 index 0000000000000..a154308d725d8 --- /dev/null +++ b/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip @@ -0,0 +1,44 @@ +// Independence / robustness: an instrumented HIP program that never launches a +// kernel still writes its host .profraw, and the device drain is a clean no-op +// (no crash, no spurious device .profraw). We assert the no-op condition +// directly via the runtime's verbose log rather than rely on HIP lazy-loading +// to leave the device code object unloaded -- the loader may load it for +// other reasons (e.g. eager registration), and in that case the drain +// legitimately walks it and reports zero instrumented sections / zero +// drained. Either outcome is correct. +// +// REQUIRES: hip, amdgpu +// The terminal conditions checked below ("no GPU agents", "no loaded +// segments", "drained=0") are Linux HSA-drain strings with no Windows analog. +// UNSUPPORTED: windows + +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: %clang -x hip --offload-arch=%amdgpu_arch \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: LLVM_PROFILE_VERBOSE=1 \ +// RUN: %t.dir/a.out 2> %t.dir/verbose.log +// RUN: ls %t.dir/host.*.profraw +// No arch-prefixed device profraw should have been produced. +// RUN: not ls %t.dir/gfx*.profraw +// The drain must have run; one of these three terminal conditions must hold: +// - no GPU agents enumerated (test host has /dev/kfd but no usable agent) +// - no loaded code object segments at exit +// - the walk completed and drained=0 (no instrumented kernel was launched +// so the device code object either wasn't loaded or its bounds were +// empty/already drained) +// RUN: FileCheck --input-file=%t.dir/verbose.log %s +// CHECK: {{no GPU agents present|no loaded segments|drained=0}} + +#include <hip/hip_runtime.h> + +// Defined but never launched. +__global__ void unused(int *p) { *p += 1; } + +int main() { + int n = 0; + (void)hipGetDeviceCount(&n); + return 0; +} diff --git a/compiler-rt/test/profile/AMDGPU/device-symbols.hip b/compiler-rt/test/profile/AMDGPU/device-symbols.hip new file mode 100644 index 0000000000000..f12283b7da636 --- /dev/null +++ b/compiler-rt/test/profile/AMDGPU/device-symbols.hip @@ -0,0 +1,42 @@ +// The decoupled drain reads only the canonical __llvm_profile_sections bounds +// table provided by the device profile runtime (InstrProfilingPlatformGPU.c), +// since clang no longer emits a per-TU struct. Assert that symbol is present +// in the device ELF's dynamic symbol table (protected visibility) for both +// non-RDC and RDC device compiles. This is the contract the drainer depends on. +// +// REQUIRES: hip, amdgpu + +// RUN: rm -rf %t.dir && mkdir -p %t.dir + +// --- non-RDC --- +// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// Extraction failure here would make the readelf invocation succeed against +// an empty/missing file; surface it instead of hiding it behind `|| true`. +// RUN: llvm-objdump --offloading %t.dir/a.out > /dev/null +// RUN: llvm-readelf --dyn-syms %t.dir/a.out.0.hip-amdgcn-amd-amdhsa--*gfx* \ +// RUN: | FileCheck %s + +// --- RDC --- +// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fgpu-rdc \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/b.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// RUN: llvm-objdump --offloading %t.dir/b.out > /dev/null +// RUN: llvm-readelf --dyn-syms %t.dir/b.out.0.hip-amdgcn-amd-amdhsa--*gfx* \ +// RUN: | FileCheck %s + +// CHECK: PROTECTED {{.*}} __llvm_profile_sections + +#include <hip/hip_runtime.h> + +__global__ void k(int *p) { *p += 1; } + +int main() { + int *d = nullptr; + if (hipMalloc(&d, sizeof(int)) != hipSuccess) + return 2; + k<<<1, 1>>>(d); + (void)hipFree(d); + return 0; +} diff --git a/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py b/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py new file mode 100644 index 0000000000000..5148dd6b9e2f2 --- /dev/null +++ b/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py @@ -0,0 +1,4 @@ +# Device-profile drain tests: require an AMD GPU (and, implicitly, the amdgcn +# device profile runtime in the resource directory and a ROCm/HIP install). +if "amdgpu" not in config.available_features: + config.unsupported = True diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip b/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip new file mode 100644 index 0000000000000..8cbe7c970052c --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip @@ -0,0 +1,51 @@ +// Test basic HIP PGO instrumentation and profile collection. +// +// REQUIRES: hip, amdgpu +// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64 +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %run %t +// RUN: ls %t.dir/prof.profraw +// RUN: llvm-profdata merge -o %t.profdata %t.dir/ +// RUN: llvm-profdata show --all-functions %t.profdata \ +// RUN: | FileCheck %s --check-prefix=PROF +// +// PROF: _Z6squarePiPKii +// PROF: main +// PROF: Functions shown: 2 +// PROF: Total functions: 2 + +#include <hip/hip_runtime.h> +#include <cstdio> + +__global__ void square(int *out, const int *in, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < n) + out[idx] = in[idx] * in[idx]; +} + +int main() { + constexpr int N = 64; + int h_in[N], h_out[N]; + for (int i = 0; i < N; ++i) h_in[i] = i; + + int *d_in, *d_out; + (void)hipMalloc(&d_in, N * sizeof(int)); + (void)hipMalloc(&d_out, N * sizeof(int)); + (void)hipMemcpy(d_in, h_in, N * sizeof(int), hipMemcpyHostToDevice); + + square<<<1, N>>>(d_out, d_in, N); + + (void)hipMemcpy(h_out, d_out, N * sizeof(int), hipMemcpyDeviceToHost); + + int ok = 1; + for (int i = 0; i < N; ++i) + if (h_out[i] != i * i) ok = 0; + + printf("%s\n", ok ? "PASS" : "FAIL"); + (void)hipFree(d_in); + (void)hipFree(d_out); + return !ok; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip b/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip new file mode 100644 index 0000000000000..5a2393f8dcc47 --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip @@ -0,0 +1,63 @@ +// Explicit-collect idempotency: a program that calls +// __llvm_profile_hip_collect_device_data() itself *after* a launch (e.g. a +// periodic profile dump) and then also exits normally must not double-count the +// device counters. The explicit drain and the atexit drain write the same +// arch-named device profraw, so the merged profile must reflect a single launch +// (function count 64, even-branch 32), not two. Complements device-early-collect +// (which covers a collect *before* the first launch). +// +// REQUIRES: hip, amdgpu +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out +// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata +// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \ +// RUN: | FileCheck %s +// +// A single launch of 64 threads, drained twice (explicit + atexit), must still +// merge to exactly one launch's worth of counts. +// CHECK: _Z8classifyPii: +// CHECK: Function count: 64 +// CHECK: Block counts: [0, 32] + +#include <hip/hip_runtime.h> +#include <cstdio> + +extern "C" int __llvm_profile_hip_collect_device_data(void); + +__global__ void classify(int *out, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= n) + return; + if (idx % 2 == 0) + out[idx] = 1; + else + out[idx] = 0; +} + +int main() { + constexpr int N = 64; + int *d = nullptr; + if (hipMalloc(&d, N * sizeof(int)) != hipSuccess) + return 2; + classify<<<1, N>>>(d, N); + (void)hipDeviceSynchronize(); + + // Explicit mid-program collect after the launch. The atexit drain runs again + // at exit; together they must not double the device counters. + (void)__llvm_profile_hip_collect_device_data(); + + int h[N]; + (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + + int evens = 0; + for (int i = 0; i < N; ++i) + evens += h[i]; + printf("%s\n", evens == 32 ? "PASS" : "FAIL"); + return evens == 32 ? 0 : 1; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip b/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip new file mode 100644 index 0000000000000..c2bfc9ac9dc66 --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip @@ -0,0 +1,56 @@ +// Quantitative device-counter correctness: the drained device profile must carry +// the *exact* per-region execution counts produced by the kernel, not merely +// "some counts are present". A single launch of 64 threads over classify() must +// record a function entry count of 64 and an even-branch block count of 32 +// (idx % 2 == 0 holds for exactly half of idx in [0, 64)). This pins the drain + +// dedup path against silent under/over-counting (e.g. a dedup bug that dropped +// or doubled a section would change these numbers). +// +// REQUIRES: hip, amdgpu +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out +// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata +// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \ +// RUN: | FileCheck %s +// +// The device kernel ran with exactly 64 threads, all of which entered the +// function; the even branch was taken 32 times and the early-return path 0. +// CHECK: _Z8classifyPii: +// CHECK: Function count: 64 +// CHECK: Block counts: [0, 32] + +#include <hip/hip_runtime.h> +#include <cstdio> + +__global__ void classify(int *out, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= n) + return; + if (idx % 2 == 0) + out[idx] = 1; + else + out[idx] = 0; +} + +int main() { + constexpr int N = 64; + int *d = nullptr; + if (hipMalloc(&d, N * sizeof(int)) != hipSuccess) + return 2; + classify<<<1, N>>>(d, N); + (void)hipDeviceSynchronize(); + int h[N]; + (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + + int evens = 0; + for (int i = 0; i < N; ++i) + evens += h[i]; + printf("%s\n", evens == 32 ? "PASS" : "FAIL"); + return evens == 32 ? 0 : 1; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip b/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip new file mode 100644 index 0000000000000..a867c30f0edfb --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip @@ -0,0 +1,51 @@ +// Test HIP coverage mapping produces source-level coverage for host code. +// +// REQUIRES: hip, amdgpu +// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64 +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %run %t +// RUN: llvm-profdata merge -o %t.profdata %t.dir/ +// RUN: llvm-cov report %t -instr-profile=%t.profdata 2>&1 \ +// RUN: | FileCheck %s --check-prefix=REPORT +// +// REPORT: instrprof-hip-coverage.hip +// No coverage column should be fully uncovered. Anchor on a non-digit before +// the "0.00%" so this does not spuriously match e.g. "80.00%". +// REPORT-NOT: {{[^.0-9]0[.]00%}} + +#include <hip/hip_runtime.h> +#include <cstdio> + +__device__ int gpu_abs(int x) { + return x < 0 ? -x : x; +} + +__global__ void abs_kernel(int *data, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < n) + data[idx] = gpu_abs(data[idx]); +} + +int main() { + constexpr int N = 16; + int h[N]; + for (int i = 0; i < N; ++i) + h[i] = (i % 2 == 0) ? i : -i; + + int *d; + (void)hipMalloc(&d, N * sizeof(int)); + (void)hipMemcpy(d, h, N * sizeof(int), hipMemcpyHostToDevice); + abs_kernel<<<1, N>>>(d, N); + (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + + int ok = 1; + for (int i = 0; i < N; ++i) + if (h[i] != i) ok = 0; + + printf("%s\n", ok ? "PASS" : "FAIL"); + return !ok; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip b/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip new file mode 100644 index 0000000000000..a24b28ec9af0a --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip @@ -0,0 +1,67 @@ +// Test that device-side branching is captured in profile counters. +// Exercises the classify-style pattern where different branches are taken +// by different threads, verifying that counter values reflect actual execution. +// +// REQUIRES: hip, amdgpu +// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64 +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %run %t +// RUN: llvm-profdata merge -o %t.profdata %t.dir/ +// RUN: llvm-profdata show --all-functions %t.profdata \ +// RUN: | FileCheck %s --check-prefix=PROF +// +// Device functions should appear with non-zero counters. The __device__ +// classify() helper is inlined into the histogram kernel, so it does not get a +// separate profile record; its branching is captured within the kernel's +// counters instead. +// PROF-DAG: _Z9histogramPKiPii +// PROF-DAG: main +// PROF: Total functions: 2 +// PROF: Maximum function count: {{[1-9][0-9]*}} + +#include <hip/hip_runtime.h> +#include <cstdio> + +__device__ int classify(int x) { + if (x > 100) return 2; + else if (x > 0) return 1; + else return 0; +} + +__global__ void histogram(const int *input, int *bins, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < n) { + int cls = classify(input[idx]); + atomicAdd(&bins[cls], 1); + } +} + +int main() { + constexpr int N = 256; + constexpr int NBINS = 3; + + int h_in[N], h_bins[NBINS] = {}; + for (int i = 0; i < N; ++i) + h_in[i] = (i % 3 == 0) ? -1 : (i % 3 == 1) ? 50 : 200; + + int *d_in, *d_bins; + (void)hipMalloc(&d_in, N * sizeof(int)); + (void)hipMalloc(&d_bins, NBINS * sizeof(int)); + (void)hipMemcpy(d_in, h_in, N * sizeof(int), hipMemcpyHostToDevice); + (void)hipMemset(d_bins, 0, NBINS * sizeof(int)); + + histogram<<<1, N>>>(d_in, d_bins, N); + + (void)hipMemcpy(h_bins, d_bins, NBINS * sizeof(int), hipMemcpyDeviceToHost); + printf("bins: [%d, %d, %d]\n", h_bins[0], h_bins[1], h_bins[2]); + + int ok = (h_bins[0] > 0 && h_bins[1] > 0 && h_bins[2] > 0); + printf("%s\n", ok ? "PASS" : "FAIL"); + + (void)hipFree(d_in); + (void)hipFree(d_bins); + return !ok; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip b/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip new file mode 100644 index 0000000000000..c79cf568f88bc --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip @@ -0,0 +1,61 @@ +// Fork safety: loading the profile-instrumented library must NOT initialize +// HSA in a process that itself never touches HIP and only runs device work in +// forked children. RCCL's unit tests follow exactly this pattern -- the parent +// deliberately keeps HIP/HSA uninitialized and launches kernels only inside +// forked children. If the profile runtime's library constructor eagerly called +// hsa_init(), the child would inherit invalid HSA state across fork() and crash +// inside HSA (e.g. hsa_amd_signal_create -> SharedSignalPool::alloc). The HSA +// drain therefore brings HSA up lazily, never from a constructor. +// +// REQUIRES: hip, amdgpu +// The eager-hsa_init fork hazard and the lazy HSA drain are Linux-only. +// UNSUPPORTED: windows +// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: %run %t 2>&1 | FileCheck %s +// +// The forked child must complete its kernel without crashing in HSA. +// CHECK: PASS + +#include <hip/hip_runtime.h> +#include <cstdio> +#include <sys/wait.h> +#include <unistd.h> + +__global__ void increment(int *p) { *p += 1; } + +static int doChildWork() { + int *d = nullptr; + if (hipMalloc(&d, sizeof(int)) != hipSuccess) + return 1; + int h = 41; + (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice); + increment<<<1, 1>>>(d); + if (hipDeviceSynchronize() != hipSuccess) + return 1; + (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + return h == 42 ? 0 : 1; +} + +int main() { + // The parent intentionally performs no HIP/HSA work before forking. + pid_t pid = fork(); + if (pid < 0) { + printf("FAIL (fork failed)\n"); + return 1; + } + if (pid == 0) { + // Child runs the device work; _exit avoids flushing the parent's profile + // handlers from the child (the RCCL test pattern). + _exit(doChildWork()); + } + + int status = 0; + (void)waitpid(pid, &status, 0); + int ok = WIFEXITED(status) && WEXITSTATUS(status) == 0; + printf("%s\n", ok ? "PASS" : "FAIL"); + return ok ? 0 : 1; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip new file mode 100644 index 0000000000000..6a99546d34bdb --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip @@ -0,0 +1,57 @@ +// Test that HIP PGO works on multi-GPU systems. The kernel runs on the default +// device, so the host-shadow drain (guarded by upstream's launch tracking) +// collects only that device and the supplemental HSA agent-walk then finds the +// same code object and dedups it out. The point of the test is that neither +// pass crashes or hangs reading a non-resident device on a host with several +// GPUs (the failure mode that the launch tracking + HSA residency walk fix). +// +// REQUIRES: hip, amdgpu +// The "walk complete" / dedup notes are Linux-only HSA-drain strings; the +// Windows host-shadow drain collects only the current device. +// UNSUPPORTED: windows +// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: LLVM_PROFILE_VERBOSE=1 %run %t 2>&1 | FileCheck %s +// +// The host-shadow pass drains the launched device, the HSA walk finds that same +// code object and dedups it (drained=0), and the program does not crash. +// CHECK: Copied device sections: +// CHECK: device bounds already drained, skipping +// CHECK: walk complete: agents={{[0-9]+}} pairs={{[0-9]+}} found={{[1-9][0-9]*}} drained={{[0-9]+}} +// CHECK: PASS + +#include <hip/hip_runtime.h> +#include <cstdio> + +__global__ void add_one(int *data, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < n) + data[idx] += 1; +} + +int main() { + int ndev = 0; + (void)hipGetDeviceCount(&ndev); + + constexpr int N = 32; + int h_data[N]; + for (int i = 0; i < N; ++i) h_data[i] = i; + + int *d_data; + (void)hipMalloc(&d_data, N * sizeof(int)); + (void)hipMemcpy(d_data, h_data, N * sizeof(int), hipMemcpyHostToDevice); + + add_one<<<1, N>>>(d_data, N); + + (void)hipMemcpy(h_data, d_data, N * sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d_data); + + int ok = 1; + for (int i = 0; i < N; ++i) + if (h_data[i] != i + 1) ok = 0; + + printf("%s (devices=%d)\n", ok ? "PASS" : "FAIL", ndev); + return !ok; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip new file mode 100644 index 0000000000000..8cf1258a63535 --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip @@ -0,0 +1,63 @@ +// Multi-process accumulation: device counters from several independent runs of an +// instrumented HIP program must accumulate when their profraws are offline-merged +// (llvm-profdata merge), the common real-world feedback-collection path. Three +// runs of a 64-thread launch must sum to a function count of 192 and an +// even-branch block count of 96 (3 x 64 / 3 x 32). +// +// Note: on-the-fly merge-pooling via LLVM_PROFILE_FILE=...%m does NOT currently +// accumulate *device* counters -- the device profraw is rewritten per process +// rather than merged in place -- so each process must write a distinct file +// (here via %p) and the accumulation is done by llvm-profdata merge. +// +// REQUIRES: hip, amdgpu +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \ +// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \ +// RUN: -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.dir/run1.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out +// RUN: env LLVM_PROFILE_FILE=%t.dir/run2.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out +// RUN: env LLVM_PROFILE_FILE=%t.dir/run3.%%p.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out +// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata +// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \ +// RUN: | FileCheck %s +// +// CHECK: _Z8classifyPii: +// CHECK: Function count: 192 +// CHECK: Block counts: [0, 96] + +#include <hip/hip_runtime.h> +#include <cstdio> + +__global__ void classify(int *out, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= n) + return; + if (idx % 2 == 0) + out[idx] = 1; + else + out[idx] = 0; +} + +int main() { + constexpr int N = 64; + int *d = nullptr; + if (hipMalloc(&d, N * sizeof(int)) != hipSuccess) + return 2; + classify<<<1, N>>>(d, N); + (void)hipDeviceSynchronize(); + int h[N]; + (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + + int evens = 0; + for (int i = 0; i < N; ++i) + evens += h[i]; + printf("%s\n", evens == 32 ? "PASS" : "FAIL"); + return evens == 32 ? 0 : 1; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip new file mode 100644 index 0000000000000..0fd6185b82441 --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip @@ -0,0 +1,58 @@ +// Test PGO with multiple kernel launches from a single TU. +// Verifies that counters from all device functions are collected correctly. +// +// REQUIRES: hip, amdgpu +// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64 +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %run %t +// RUN: llvm-profdata merge -o %t.profdata %t.dir/ +// RUN: llvm-profdata show --all-functions %t.profdata \ +// RUN: | FileCheck %s --check-prefix=PROF +// +// All three kernels plus main should be profiled. +// PROF-DAG: _Z4fillPii +// PROF-DAG: _Z5scalePii +// PROF-DAG: _Z6negatePii +// PROF-DAG: main +// PROF: Total functions: 4 + +#include <hip/hip_runtime.h> +#include <cstdio> + +__global__ void fill(int *data, int val) { + data[threadIdx.x] = val; +} + +__global__ void scale(int *data, int factor) { + data[threadIdx.x] *= factor; +} + +__global__ void negate(int *data, int n) { + int idx = threadIdx.x; + if (idx < n) + data[idx] = -data[idx]; +} + +int main() { + constexpr int N = 16; + int h[N]; + int *d; + (void)hipMalloc(&d, N * sizeof(int)); + + fill<<<1, N>>>(d, 5); + scale<<<1, N>>>(d, 3); + negate<<<1, N>>>(d, N); + + (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + + int ok = 1; + for (int i = 0; i < N; ++i) + if (h[i] != -15) ok = 0; + + printf("%s\n", ok ? "PASS" : "FAIL"); + return !ok; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip b/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip new file mode 100644 index 0000000000000..5d3dea671047b --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip @@ -0,0 +1,60 @@ +// Test PGO when the kernel runs on a non-default device (here the program +// selects device 1). Upstream's launch tracking records that device 1 was used, +// so the host-shadow drain skips the other devices and collects device 1, and +// the supplemental HSA agent-walk then finds that same code object and dedups it +// out. This exercises both that the correct device is drained and that an +// unused device is never read (which would fault/hang on a multi-GPU host). +// +// REQUIRES: hip, amdgpu, multi-device +// The "walk complete" / dedup notes are Linux-only HSA-drain strings; the +// Windows host-shadow drain only collects the current device. +// UNSUPPORTED: windows +// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64 +// RUN: env LLVM_PROFILE_FILE=%t.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: LLVM_PROFILE_VERBOSE=1 %run %t 2>&1 | FileCheck %s +// +// The launched device (1) is drained, the unused default device is skipped, and +// the HSA walk finds the same code object and dedups it (drained=0). +// CHECK: Skipping unused device 0 +// CHECK: Collecting static profile data from device 1 +// CHECK: Copied device sections: +// CHECK: device bounds already drained, skipping +// CHECK: walk complete: agents={{[0-9]+}} pairs={{[0-9]+}} found={{[1-9][0-9]*}} drained={{[0-9]+}} +// CHECK: PASS + +#include <hip/hip_runtime.h> +#include <cstdio> + +__global__ void fill(int *data, int val, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < n) + data[idx] = val; +} + +int main() { + int ndev = 0; + (void)hipGetDeviceCount(&ndev); + if (ndev < 2) { + printf("PASS (skipped: only %d device)\n", ndev); + return 0; + } + + (void)hipSetDevice(1); + + constexpr int N = 32; + int h[N] = {}; + int *d; + (void)hipMalloc(&d, N * sizeof(int)); + fill<<<1, N>>>(d, 99, N); + (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost); + (void)hipFree(d); + + int ok = 1; + for (int i = 0; i < N; ++i) + if (h[i] != 99) ok = 0; + + printf("%s\n", ok ? "PASS" : "FAIL"); + return !ok; +} diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip b/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip new file mode 100644 index 0000000000000..9a8a8187f8e77 --- /dev/null +++ b/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip @@ -0,0 +1,63 @@ +// Test the full PGO cycle: instrument, collect, merge, optimize. +// Verifies that the optimized binary produces correct output and that +// profile data is consumed without errors. +// +// REQUIRES: hip, amdgpu +// +// Step 1: Build instrumented binary. +// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t.instr \ +// RUN: -L%hip_lib_path -lamdhip64 +// +// Step 2: Run to collect profile data. +// RUN: rm -rf %t.dir && mkdir -p %t.dir +// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \ +// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %run %t.instr 2>&1 | FileCheck %s +// +// Step 3: Merge profile data. +// RUN: llvm-profdata merge -o %t.profdata %t.dir/ +// +// Step 4: Build optimized binary with profile data. +// RUN: %clang -x hip -fprofile-instr-use=%t.profdata \ +// RUN: --offload-arch=%amdgpu_arch %s -o %t.opt \ +// RUN: -L%hip_lib_path -lamdhip64 -O2 +// +// Step 5: Run optimized binary. +// RUN: env LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \ +// RUN: HIP_VISIBLE_DEVICES=0 %run %t.opt 2>&1 | FileCheck %s +// +// CHECK: PASS + +#include <hip/hip_runtime.h> +#include <cstdio> + +__global__ void scale(float *data, float factor, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < n) + data[idx] *= factor; +} + +int main() { + constexpr int N = 128; + float h[N]; + for (int i = 0; i < N; ++i) h[i] = (float)i; + + float *d; + (void)hipMalloc(&d, N * sizeof(float)); + (void)hipMemcpy(d, h, N * sizeof(float), hipMemcpyHostToDevice); + + scale<<<1, N>>>(d, 2.0f, N); + + (void)hipMemcpy(h, d, N * sizeof(float), hipMemcpyDeviceToHost); + (void)hipFree(d); + + int ok = 1; + for (int i = 0; i < N; ++i) { + float expected = (float)(i * 2); + if (h[i] != expected) ok = 0; + } + + printf("%s\n", ok ? "PASS" : "FAIL"); + return !ok; +} diff --git a/compiler-rt/test/profile/device-pgo/README.md b/compiler-rt/test/profile/device-pgo/README.md new file mode 100644 index 0000000000000..4338c637abe49 --- /dev/null +++ b/compiler-rt/test/profile/device-pgo/README.md @@ -0,0 +1,125 @@ +# HIP device PGO / code coverage: standalone build & test recipe + +This directory provides a CMake-based recipe to build and exercise HIP device +profile-guided optimization (PGO) and source-based code coverage **outside +TheRock**, using only an `llvm-project` checkout plus a ROCm runtime. + +It builds, in one configure: + +- the host toolchain (`clang`, `clang++`, `lld`, `llvm-profdata`, `llvm-cov`) + and the lit-lite test utilities (`FileCheck`, `not`); +- the host ROCm drain runtime `clang_rt.profile_rocm` (opt-in, + `COMPILER_RT_BUILD_PROFILE_ROCM=ON`). It runs the upstream host-shadow drain on + all platforms; on **Linux** `InstrProfilingPlatformROCm.cpp` additionally runs a + supplemental HSA-introspection pass (with content-dedup) to collect device code + objects that have no host shadow (e.g. device-linked/RCCL kernels); +- the **amdgcn device** profile runtime `libclang_rt.profile.a` (the baremetal + profile subset that provides `__llvm_profile_instrument_gpu` and the + `__llvm_profile_sections` bounds table), built for the `amdgcn-amd-amdhsa` + runtime target with LLVM libc for amdgcn. + +## Why a separate library + +Upstream relands HIP offload PGO runtime support as the **opt-in** +`clang_rt.profile_rocm` (llvm#201606), a `/MD` superset of `clang_rt.profile`; +the base library stays unchanged. The driver links `clang_rt.profile_rocm` +ahead of `clang_rt.profile` for HIP host links when profiling is requested +(see `clang/lib/Driver/ToolChains/{Linux,MSVC}.cpp`). This recipe just turns +the option on and builds the matching amdgcn device runtime. + +## Prerequisites + +- A ROCm installation (for `libamdhip64` and, on Linux, `libhsa-runtime64`), + e.g. `/opt/rocm`. Export `ROCM_PATH`. +- An AMD GPU visible to the runtime for the *run* step (the build step does + not need a GPU). `amdgpu-arch` should list your device(s). +- Ninja, a host C/C++ compiler, and Python 3. + +## Build + +```bash +export ROCM_PATH=/opt/rocm +./build.sh # builds into <repo>/build/device-pgo +# or: ./build.sh /path/to/builddir +``` + +Key outputs under the build dir: + +``` +bin/{clang,clang++,lld,llvm-profdata,llvm-cov,FileCheck,not} +lib/clang/<ver>/lib/<host-triple>/libclang_rt.profile_rocm.a +lib/clang/<ver>/lib/amdgcn-amd-amdhsa/libclang_rt.profile.a +``` + +See `toolchain-cache.cmake` for the exact CMake variables, including the +`LLVM_RUNTIME_TARGETS="default;amdgcn-amd-amdhsa"` split. + +## Run the tests + +The lit-lite runner (`../run_gpu_tests.py`) compiles each `.hip` test with the +just-built toolchain, runs it on the GPU, and pipes output through `FileCheck`. +It auto-detects features (`multi-device` via `amdgpu-arch`) so tests that need +two visible GPUs are skipped on single-GPU hosts. + +```bash +python3 ../run_gpu_tests.py \ + --toolchain-bin "$PWD/<builddir>/bin" \ + --hip-lib-path "$ROCM_PATH/lib" \ + ../GPU ../AMDGPU +``` + +`--toolchain-bin` must be an **absolute** path (the runner executes each RUN +line from a temp directory). With the toolchain's `amdgpu-arch`/`offload-arch` +on hand, `--offload-arch=native` resolves automatically and the `multi-device` +feature is enabled when 2+ GPUs are visible (so multi-GPU tests run on a +multi-GPU host and are skipped otherwise). On a multi-gfx90a host this suite is +15 passed, 0 failed. + +### Coverage notes / known gaps + +- Quantitative device-counter correctness (`instrprof-hip-counter-correctness`), + multi-process offline accumulation (`instrprof-hip-multi-process-merge`) and + explicit-collect idempotency (`instrprof-hip-collect-after`) pin exact device + counter values, so a dedup or drain regression that drops/doubles a section is + caught. +- `LLVM_PROFILE_FILE=...%m` on-the-fly merge-pooling does **not** accumulate + *device* counters today (the device profraw is rewritten per process rather + than merged in place); multi-process accumulation must go through + `llvm-profdata merge` of distinct per-process files. +- There is no in-tree test that drains a code object with **no** host shadow in + isolation (the pure device-linked/RCCL case the HSA pass uniquely handles): it + requires a real device-side library build (the profile runtime linked into the + device image), which is not expressible in the lit-lite harness via the clang + driver. The dedup tests do prove the HSA pass finds and dedups the same code + objects the host-shadow pass drains; validating the no-host-shadow drain needs + an actual RCCL-style binary in downstream CI. + +## Manual workflow (for reference) + +```bash +CLANG=<builddir>/bin/clang++ +# 1. Instrumented build (host + device). +$CLANG -O2 -fprofile-instr-generate -fcoverage-mapping \ + --offload-arch=gfx1100 -xhip app.hip -o app + +# 2. Run. Produces a host .profraw and a device +# <name>.amdgcn-amd-amdhsa.<arch>.profraw drained by clang_rt.profile_rocm. +LLVM_PROFILE_FILE='app-%p.profraw' ./app + +# 3. Merge (device profiles are merged per GPU arch). +<builddir>/bin/llvm-profdata merge -o app.profdata app-*.profraw + +# 4. Coverage report (device). +<builddir>/bin/llvm-cov show ./app -instr-profile=app.profdata +``` + +## Notes / environment-specific knobs + +- `--offload-arch` must match your GPU; the amdgcn device runtime is target + generic but the app's device code is per-arch. The build installs + `offload-arch` (and the `amdgpu-arch` alias) into `<builddir>/bin`, so + `--offload-arch=native` works without a system ROCm `amdgpu-arch`. +- The amdgcn runtime target requires LLVM libc for amdgcn; if your environment + cannot build it, drop `libc` from + `RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES` only if your headers are + otherwise provided. diff --git a/compiler-rt/test/profile/device-pgo/build.sh b/compiler-rt/test/profile/device-pgo/build.sh new file mode 100755 index 0000000000000..edf90f42fb8c1 --- /dev/null +++ b/compiler-rt/test/profile/device-pgo/build.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# Standalone (non-TheRock) build of the toolchain + host/device runtimes used by +# the HIP device-PGO / code-coverage tests. See toolchain-cache.cmake and +# README.md for details. +# +# ./build.sh [BUILD_DIR] +# +# Env knobs: +# LLVM_SRC path to the llvm-project checkout (default: repo root inferred +# from this script's location) +# JOBS parallelism for ninja (default: nproc) +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# .../compiler-rt/test/profile/device-pgo -> repo root is four levels up. +LLVM_SRC="${LLVM_SRC:-$(cd "${SCRIPT_DIR}/../../../.." && pwd)}" +BUILD_DIR="${1:-${LLVM_SRC}/build/device-pgo}" +JOBS="${JOBS:-$(nproc)}" + +echo "llvm-project source : ${LLVM_SRC}" +echo "build directory : ${BUILD_DIR}" +echo "parallel jobs : ${JOBS}" + +cmake -G Ninja \ + -S "${LLVM_SRC}/llvm" \ + -B "${BUILD_DIR}" \ + -C "${SCRIPT_DIR}/toolchain-cache.cmake" + +# The 'clang' target also produces the clang++ symlink. The offload toolchain +# tools (clang-offload-bundler, clang-linker-wrapper, llvm-link, +# llvm-offload-binary) and offload-arch (also installed as amdgpu-arch) are +# needed to compile/link a HIP program and to resolve --offload-arch=native / +# the multi-device test feature. 'runtimes' builds both the host (default) and +# amdgcn device runtime targets. +ninja -C "${BUILD_DIR}" -j "${JOBS}" \ + clang lld \ + clang-offload-bundler clang-linker-wrapper llvm-link llvm-offload-binary \ + offload-arch \ + llvm-profdata llvm-cov FileCheck not \ + runtimes + +cat <<EOF + +Build complete. + +Toolchain bin : ${BUILD_DIR}/bin +Run the GPU tests with, e.g.: + + python3 ${SCRIPT_DIR}/../run_gpu_tests.py \\ + --toolchain-bin ${BUILD_DIR}/bin \\ + --hip-lib-path \${ROCM_PATH:-/opt/rocm}/lib \\ + ${SCRIPT_DIR}/../GPU ${SCRIPT_DIR}/../AMDGPU + +(--toolchain-bin must be an absolute path; the runner executes RUN lines from a +temp dir. See README.md for more.) +EOF diff --git a/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake b/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake new file mode 100644 index 0000000000000..f48656c66d82d --- /dev/null +++ b/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake @@ -0,0 +1,55 @@ +# CMake cache for a standalone (non-TheRock) build of everything needed to +# compile, run, and FileCheck the HIP device-PGO / code-coverage tests under +# compiler-rt/test/profile/{GPU,AMDGPU}. +# +# It produces, in a single configure: +# * the host toolchain: clang, clang++, lld, llvm-profdata, llvm-cov, plus the +# test utilities FileCheck and not (LLVM_INSTALL_UTILS); +# * the host ROCm drain runtime clang_rt.profile_rocm (opt-in, links the +# sanitizer interception object libs -- hence COMPILER_RT_BUILD_SANITIZERS); +# * the amdgcn device profile runtime libclang_rt.profile.a (the baremetal +# profile subset providing __llvm_profile_instrument_gpu and the +# __llvm_profile_sections bounds table), built for the amdgcn-amd-amdhsa +# runtime target via compiler-rt/cmake/caches/AMDGPU.cmake. Building the +# device runtime requires LLVM libc for amdgcn, so libc is enabled for that +# runtime target. +# +# Usage (see ./build.sh for a wrapper): +# cmake -G Ninja -S llvm -B build/device-pgo \ +# -C compiler-rt/test/profile/device-pgo/toolchain-cache.cmake +# ninja -C build/device-pgo clang lld clang-offload-bundler \ +# clang-linker-wrapper llvm-link llvm-offload-binary offload-arch \ +# llvm-profdata llvm-cov FileCheck not runtimes +# +# Outputs (under build/device-pgo): +# bin/{clang,clang++,lld,llvm-profdata,llvm-cov,FileCheck,not} +# lib/clang/<ver>/lib/<host-triple>/libclang_rt.profile_rocm.a +# lib/clang/<ver>/lib/amdgcn-amd-amdhsa/libclang_rt.profile.a + +set(CMAKE_BUILD_TYPE Release CACHE STRING "") + +set(LLVM_ENABLE_PROJECTS "clang;lld" CACHE STRING "") +set(LLVM_ENABLE_RUNTIMES "compiler-rt" CACHE STRING "") +set(LLVM_TARGETS_TO_BUILD "host;AMDGPU" CACHE STRING "") +set(LLVM_ENABLE_PER_TARGET_RUNTIME_DIR ON CACHE BOOL "") +set(LLVM_ENABLE_ASSERTIONS ON CACHE BOOL "") + +set(CLANG_DEFAULT_LINKER "lld" CACHE STRING "") +set(CLANG_DEFAULT_RTLIB "compiler-rt" CACHE STRING "") + +# Make FileCheck / not available in the install/bin tree for the lit-lite runner. +set(LLVM_INSTALL_UTILS ON CACHE BOOL "") + +# Build host (default) and device (amdgcn) runtimes in one tree. +set(LLVM_RUNTIME_TARGETS "default;amdgcn-amd-amdhsa" CACHE STRING "") + +# Host runtimes: turn on the opt-in ROCm host drain library. It pulls in the +# sanitizer interception object libs, so sanitizers must be built too. +set(RUNTIMES_default_COMPILER_RT_BUILD_PROFILE_ROCM ON CACHE BOOL "") +set(RUNTIMES_default_COMPILER_RT_BUILD_SANITIZERS ON CACHE BOOL "") + +# Device runtime: the amdgcn baremetal profile subset, built with LLVM libc for +# amdgcn (freestanding C headers). +set(RUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES + "${CMAKE_SOURCE_DIR}/../compiler-rt/cmake/caches/AMDGPU.cmake" CACHE STRING "") +set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc" CACHE STRING "") diff --git a/compiler-rt/test/profile/run_gpu_tests.py b/compiler-rt/test/profile/run_gpu_tests.py new file mode 100644 index 0000000000000..27563219ba0ad --- /dev/null +++ b/compiler-rt/test/profile/run_gpu_tests.py @@ -0,0 +1,408 @@ +#!/usr/bin/env python3 +"""Minimal lit-style runner for the HIP device-PGO tests. + +The compiler-rt profile lit suite (and llvm-lit / FileCheck) is not part of the +installed ROCm artifact, but the toolchain, the amdgcn device profile runtime, +and the HIP runtime are. This runner executes the +``compiler-rt/test/profile/{GPU,AMDGPU}/*.hip`` tests directly against an +installed toolchain on a real GPU runner, interpreting just the slice of lit +markup those tests use: + + - ``// REQUIRES:`` / ``// UNSUPPORTED:`` boolean feature gating, + - ``// RUN:`` lines (with ``\\`` continuations) and the fixed substitution set + (%clang, %s, %t[.*], %amdgpu_arch, %hip_lib_path, %run, %%), + - delegation to ``FileCheck`` / ``not`` (real binaries if present on PATH, + otherwise shims backed by the ``filecheck`` PyPI package and a tiny + exit-code inverter). + +Each RUN line is executed via ``bash -e -o pipefail -c`` so pipes, redirection +and globbing behave as under lit. A test passes iff all its RUN lines exit 0. +""" + +import argparse +import os +import re +import shutil +import stat +import subprocess +import sys +import tempfile +from pathlib import Path + +# --- feature detection ------------------------------------------------------ + + +def _count_visible_gpus(toolchain_bin): + """Number of GPUs actually visible to the runtime, or 0 if unknown. + + Uses the toolchain's ``amdgpu-arch`` (one line per visible device). Unlike + the KFD topology under ``/sys/class/kfd`` this reflects what HIP/ROCr really + exposes -- it honours ``ROCR_VISIBLE_DEVICES`` / ``HIP_VISIBLE_DEVICES`` and + container device limits, so it matches what a test's ``hipGetDeviceCount`` + will see. It is also portable: Windows has no ``/dev/kfd``, but does ship + ``amdgpu-arch``. + """ + if not toolchain_bin: + return 0 + tb = Path(toolchain_bin) + exe = next( + (str(tb / c) for c in ("amdgpu-arch", "amdgpu-arch.exe") if (tb / c).exists()), + None, + ) + if exe is None: + return 0 + try: + proc = subprocess.run(exe, capture_output=True, text=True, timeout=60) + except (OSError, subprocess.SubprocessError): + return 0 + if proc.returncode != 0: + return 0 + return sum(1 for line in proc.stdout.splitlines() if line.strip()) + + +def detect_features(toolchain_bin=None, force=None): + """Return the set of lit features available on this runner. + + hip/amdgpu are assumed present (this runner only ever drives GPU tests on a + runner that has the toolchain + HIP). ``multi-device`` is derived from the + number of GPUs the runtime actually exposes (>= 2), via ``amdgpu-arch``. + """ + features = {"hip", "amdgpu"} + if sys.platform.startswith("linux"): + features.add("linux") + elif sys.platform.startswith("win"): + features.add("windows") + + if _count_visible_gpus(toolchain_bin) >= 2: + features.add("multi-device") + + if force: + for f in force: + features.add(f) + return features + + +# --- boolean expression evaluation (REQUIRES / UNSUPPORTED) ------------------ + +_TOKEN_RE = re.compile(r"\s*(\(|\)|\|\||&&|!|[\w.+-]+)\s*") + + +def _clause_to_py(clause): + out = [] + for tok in _TOKEN_RE.findall(clause): + if tok == "||": + out.append(" or ") + elif tok == "&&": + out.append(" and ") + elif tok == "!": + out.append(" not ") + elif tok in ("(", ")"): + out.append(tok) + elif tok == "true": + out.append("True") + elif tok == "false": + out.append("False") + else: + out.append("(%r in FEATURES)" % tok) + return "".join(out) or "True" + + +def eval_requires(expr, features): + """All comma-separated clauses must be true.""" + return all( + eval(_clause_to_py(c), {"__builtins__": {}}, {"FEATURES": features}) + for c in expr.split(",") + if c.strip() + ) + + +def eval_unsupported(expr, features): + """Unsupported if any comma-separated clause is true.""" + return any( + eval(_clause_to_py(c), {"__builtins__": {}}, {"FEATURES": features}) + for c in expr.split(",") + if c.strip() + ) + + +# --- test parsing ----------------------------------------------------------- + +_DIRECTIVE_RE = re.compile(r"(?://|#)\s*(RUN|REQUIRES|UNSUPPORTED):\s?(.*)") + + +def parse_test(path): + """Return (run_lines, requires, unsupported) for a test file.""" + runs, requires, unsupported = [], [], [] + cont = None + for raw in Path(path).read_text(errors="replace").splitlines(): + m = _DIRECTIVE_RE.search(raw) + if cont is not None: + # Continuation of a previous RUN line. + text = raw + cm = re.search(r"(?://|#)\s*RUN:\s?(.*)", raw) + if cm: + text = cm.group(1) + cont += " " + text.strip() + if cont.rstrip().endswith("\\"): + cont = cont.rstrip()[:-1] + else: + runs.append(cont) + cont = None + continue + if not m: + continue + kind, body = m.group(1), m.group(2) + if kind == "REQUIRES": + requires.append(body.strip()) + elif kind == "UNSUPPORTED": + unsupported.append(body.strip()) + elif kind == "RUN": + if body.rstrip().endswith("\\"): + cont = body.rstrip()[:-1] + else: + runs.append(body) + return runs, requires, unsupported + + +# --- substitutions ---------------------------------------------------------- + + +def make_substitutions(clang, clangxx, src, tprefix, arch, hip_lib_path): + # Order matters: longer / more specific tokens first; %% resolved last. + return [ + ("%clangxx", clangxx), + ("%clang", clang), + ("%amdgpu_arch", arch), + ("%hip_lib_path", hip_lib_path), + ("%run ", ""), + ("%s", str(src)), + ("%t", tprefix), + ("%%", "%"), + ] + + +def apply_substitutions(line, subs): + for token, value in subs: + line = line.replace(token, value) + return line + + +# --- tool shims (FileCheck / not) ------------------------------------------- + + +def ensure_tools(toolchain_bin, workdir): + """Build a PATH that resolves clang/llvm-*, FileCheck and not. + + Prefers real binaries under toolchain_bin; falls back to shims for FileCheck + (PyPI ``filecheck``) and ``not`` (exit-code inverter). + """ + shim_dir = workdir / "shims" + shim_dir.mkdir(parents=True, exist_ok=True) + path = os.pathsep.join( + [str(toolchain_bin), str(shim_dir), os.environ.get("PATH", "")] + ) + + def have(tool): + # File-based check (shutil.which is quirky across OSes / Git Bash). The + # shims are extensionless bash scripts, which Git Bash resolves via the + # shebang, so a real binary is anything matching tool or tool.exe. + tb = Path(toolchain_bin) + return (tb / tool).exists() or (tb / (tool + ".exe")).exists() + + def write_shim(name, body): + p = shim_dir / name + p.write_text(body) + p.chmod(p.stat().st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH) + + if not have("FileCheck"): + write_shim( + "FileCheck", + "#!/usr/bin/env bash\n" + 'if command -v filecheck >/dev/null 2>&1; then exec filecheck "$@"; fi\n' + 'exec python3 -m filecheck "$@"\n', + ) + if not have("not"): + write_shim( + "not", + "#!/usr/bin/env bash\n" + 'if [ "$1" = "--crash" ]; then shift; "$@"; ec=$?; ' + "[ $ec -ge 128 ] && exit 0 || exit 1; fi\n" + '"$@"; ec=$?; [ $ec -eq 0 ] && exit 1 || exit 0\n', + ) + return path + + +# --- execution -------------------------------------------------------------- + + +def run_one(path, args, features, base_env): + runs, requires, unsupported = parse_test(path) + + for expr in requires: + if not eval_requires(expr, features): + return "UNSUPPORTED", "missing requirement: %s" % expr + for expr in unsupported: + if eval_unsupported(expr, features): + return "UNSUPPORTED", "unsupported: %s" % expr + if not runs: + return "UNSUPPORTED", "no RUN lines" + + workdir = Path(tempfile.mkdtemp(prefix="profgpu-")) + tprefix = str(workdir / "t") + subs = make_substitutions( + args.clang, + args.clangxx, + Path(path).resolve(), + tprefix, + args.amdgpu_arch, + args.hip_lib_path, + ) + + if args.dry_run: + print("# %s" % path) + for line in runs: + print(" " + apply_substitutions(line, subs).strip()) + return "DRYRUN", "" + + env = dict(base_env) + env["PATH"] = ensure_tools(Path(args.toolchain_bin), workdir) + timeout = args.timeout if args.timeout and args.timeout > 0 else None + for line in runs: + cmd = apply_substitutions(line, subs).strip() + try: + proc = subprocess.run( + ["bash", "-e", "-o", "pipefail", "-c", cmd], + cwd=str(workdir), + env=env, + capture_output=True, + text=True, + timeout=timeout, + ) + except subprocess.TimeoutExpired as e: + out = e.stdout or "" + err = e.stderr or "" + if isinstance(out, bytes): + out = out.decode("utf-8", "replace") + if isinstance(err, bytes): + err = err.decode("utf-8", "replace") + detail = "RUN timed out after %gs: %s\n%s%s" % ( + timeout, + cmd, + out, + err, + ) + if not args.keep: + shutil.rmtree(workdir, ignore_errors=True) + return "FAIL", detail + if proc.returncode != 0: + detail = "RUN failed (rc=%d): %s\n%s%s" % ( + proc.returncode, + cmd, + proc.stdout, + proc.stderr, + ) + if not args.keep: + shutil.rmtree(workdir, ignore_errors=True) + return "FAIL", detail + if not args.keep: + shutil.rmtree(workdir, ignore_errors=True) + return "PASS", "" + + +def discover(paths): + tests = [] + for p in paths: + p = Path(p) + if p.is_dir(): + tests.extend(sorted(str(x) for x in p.rglob("*.hip"))) + elif p.is_file(): + tests.append(str(p)) + return tests + + +def main(): + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("tests", nargs="+", help="Test files or directories") + ap.add_argument( + "--toolchain-bin", required=False, help="Directory with clang and llvm-* tools" + ) + ap.add_argument("--hip-lib-path", default="", help="Directory with libamdhip64") + ap.add_argument("--amdgpu-arch", default="native") + ap.add_argument("--clang", help="Override clang path") + ap.add_argument("--clangxx", help="Override clang++ path") + ap.add_argument( + "--feature", + action="append", + default=[], + help="Force-enable an extra lit feature", + ) + ap.add_argument( + "--dry-run", + action="store_true", + help="Print resolved RUN lines without executing", + ) + ap.add_argument("--keep", action="store_true", help="Keep per-test temp dirs") + ap.add_argument( + "--timeout", + type=float, + default=600, + help="Per-RUN-line timeout in seconds (<=0 disables); " + "guards against a hung GPU/compiler wedging the run", + ) + args = ap.parse_args() + + if not args.dry_run and not args.toolchain_bin: + ap.error("--toolchain-bin is required unless --dry-run is given") + + if args.toolchain_bin: + binp = Path(args.toolchain_bin) + args.clang = args.clang or str(binp / "clang") + args.clangxx = args.clangxx or str(binp / "clang++") + else: + args.clang = args.clang or "clang" + args.clangxx = args.clangxx or "clang++" + + features = detect_features(args.toolchain_bin, args.feature) + print("# features: %s" % ", ".join(sorted(features))) + + base_env = dict(os.environ) + if args.toolchain_bin: + lib_dirs = [ + str(Path(args.toolchain_bin).parent / "lib"), # toolchain libs + ] + if args.hip_lib_path: + lib_dirs.append(args.hip_lib_path) + existing = base_env.get("LD_LIBRARY_PATH", "") + base_env["LD_LIBRARY_PATH"] = os.pathsep.join( + [d for d in lib_dirs if d] + ([existing] if existing else []) + ) + + tests = discover(args.tests) + if not tests: + print("error: no tests found", file=sys.stderr) + return 2 + + results = {"PASS": [], "FAIL": [], "UNSUPPORTED": [], "DRYRUN": []} + for t in tests: + status, detail = run_one(t, args, features, base_env) + results[status].append(t) + if status == "FAIL": + print("FAIL: %s" % t) + print(detail) + elif status in ("PASS", "UNSUPPORTED"): + print("%s: %s" % (status, t)) + + print( + "\n# summary: %d passed, %d failed, %d unsupported (of %d)" + % ( + len(results["PASS"]), + len(results["FAIL"]), + len(results["UNSUPPORTED"]), + len(tests), + ) + ) + return 1 if results["FAIL"] else 0 + + +if __name__ == "__main__": + sys.exit(main()) _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
