[clang] [docs][HIP] Document source-based device code coverage workflow (PR #200197)

Yaxun Liu via cfe-commits Fri, 12 Jun 2026 07:15:43 -0700

https://github.com/yxsamliu updated 
https://github.com/llvm/llvm-project/pull/200197


>From dadbe14ddf43b631addae3940ac99825b26d5f19 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <[email protected]>
Date: Thu, 28 May 2026 11:11:01 -0400
Subject: [PATCH 1/2] [docs][HIP] Document offload PGO workflow

Add a section to HIPSupport.rst describing IR-level profile-guided
optimization for HIP device code. -fprofile-generate instruments both
host and device; the runtime writes a host .profraw and one set of
device .profraw files per --offload-arch= value, with the standard
LLVM_PROFILE_FILE substitutions applying to both. Host and each
per-architecture device profile are merged independently with
llvm-profdata, and the use-phase build feeds them back via
-Xarch_host -fprofile-use= and -Xarch_<gpu-arch> -fprofile-use= (with
-Xarch_device as a single-arch shorthand).

Also add a CUDA/HIP Language Changes entry in ReleaseNotes.rst.
---
 clang/docs/HIPSupport.rst   | 102 ++++++++++++++++++++++++++++++++++++
 clang/docs/ReleaseNotes.rst |   7 +++
 2 files changed, 109 insertions(+)

diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index 82070a4042679..99559548823b2 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -951,6 +951,108 @@ Open Questions / Future Developments
 4. Offload support might be extended to cases where the ``parallel_policy`` is
    used for some or all targets.
 
+Profile-Guided Optimization for Device Code
+===========================================
+
+Clang supports IR-level profile-guided optimization (PGO) for HIP device
+code on AMD GPUs. ``-fprofile-generate`` instruments both host and
+device code; running the instrumented binary writes separate host and
+device raw profiles, which are merged independently and consumed by a
+second build that passes the appropriate profile to each side.
+
+Prerequisites
+-------------
+
+The toolchain must be built with the AMDGPU profile runtime enabled,
+which requires building ``compiler-rt`` for the ``amdgcn-amd-amdhsa``
+target via the runtimes build. A minimal CMake configuration is:
+
+.. code-block:: console
+
+   $ cmake <llvm-project>/llvm \
+       -DLLVM_ENABLE_PROJECTS='clang;lld' \
+       -DLLVM_ENABLE_RUNTIMES=compiler-rt \
+       -DLLVM_RUNTIME_TARGETS='default;amdgcn-amd-amdhsa' \
+       
-DRUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES=<llvm-project>/compiler-rt/cmake/caches/AMDGPU.cmake
 \
+       -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES='compiler-rt;libc' \
+       -DRUNTIMES_amdgcn-amd-amdhsa_RUNTIMES_USE_LIBC=llvm-libc
+
+``COMPILER_RT_BUILD_PROFILE_ROCM`` controls building the host-side
+ROCm/HIP device profile collection runtime, ``clang_rt.profile_rocm``.
+It is on by default for normal Linux and Windows compiler-rt builds,
+and off for bare-metal profile builds and unsupported hosts; leave it
+enabled. ``RUNTIMES_USE_LIBC=llvm-libc`` is required so the amdgcn
+profile compile picks up LLVM-libc's ``-isystem`` / ``-nostdlibinc``
+headers.
+
+Generate phase
+--------------
+
+The driver forwards ``-fprofile-generate`` to the device compiler and
+links the device profile runtime into the embedded device image.
+
+.. code-block:: console
+
+   $ clang++ -x hip demo.hip \
+       --offload-arch=gfx1100 --offload-arch=gfx1101 \
+       -fprofile-generate=pgo_data \
+       -o demo.instr
+
+   $ ./demo.instr
+
+When the instrumented binary exits, the runtime writes raw profile
+files into ``pgo_data/``. Host profiles use the standard LLVM profile
+filename; device profiles use the same filename with the GPU
+architecture name prepended to the basename, so each
+``--offload-arch=`` value produces its own set of device files. The
+usual ``LLVM_PROFILE_FILE`` substitutions (``%p`` for process ID,
+``%m`` for binary signature, etc.) apply to both, so multi-process
+runs do not need a separate device-side naming scheme.
+
+Merge the host profile and each device architecture's profile
+separately:
+
+.. code-block:: console
+
+   $ llvm-profdata merge -o host.profdata           pgo_data/default_*.profraw
+   $ llvm-profdata merge -o device.gfx1100.profdata pgo_data/gfx1100*.profraw
+   $ llvm-profdata merge -o device.gfx1101.profdata pgo_data/gfx1101*.profraw
+
+Use phase
+---------
+
+Host and device compilations consume different profiles, and each GPU
+architecture consumes its own. ``-Xarch_host`` selects the host
+profile and ``-Xarch_<gpu-arch>`` selects the per-architecture device
+profile:
+
+.. code-block:: console
+
+   $ clang++ -x hip demo.hip \
+       --offload-arch=gfx1100 --offload-arch=gfx1101 \
+       -Xarch_host    -fprofile-use=host.profdata \
+       -Xarch_gfx1100 -fprofile-use=device.gfx1100.profdata \
+       -Xarch_gfx1101 -fprofile-use=device.gfx1101.profdata \
+       -o demo
+
+For a single-arch build, ``-Xarch_device`` is a convenient shorthand
+that applies the same profile to every offload architecture:
+
+.. code-block:: console
+
+   $ clang++ -x hip demo.hip --offload-arch=gfx1101 \
+       -Xarch_host   -fprofile-use=host.profdata \
+       -Xarch_device -fprofile-use=device.gfx1101.profdata \
+       -o demo
+
+Notes
+-----
+
+- The instrumented build is slower than a normal build; only the use
+  phase produces the optimized binary intended for deployment.
+- Set ``LLVM_PROFILE_VERBOSE=1`` to print runtime diagnostics for
+  profile file creation and device profile collection.
+
 SPIR-V Support on HIPAMD ToolChain
 ==================================
 
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index a984982d1bd41..d241132a4e9bf 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -901,6 +901,13 @@ CUDA/HIP Language Changes
 
 - The new offloading driver is now the default for HIP. Use
   `--no-oflfoad-new-driver` to return to the old behavior.
+- Added IR-level profile-guided optimization (PGO) support for HIP
+  device code on AMD GPUs. ``-fprofile-generate`` now instruments both
+  host and device; running the instrumented binary writes host and
+  per-GPU-architecture device raw profiles, which are merged separately
+  with ``llvm-profdata`` and fed back via ``-Xarch_host`` /
+  ``-Xarch_<gpu-arch>`` ``-fprofile-use=``. See :doc:`HIPSupport` for
+  the full workflow.
 
 CUDA Support
 ^^^^^^^^^^^^

>From d174b9afd99fc07a3a7c5832d175f1a8cd383707 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <[email protected]>
Date: Thu, 28 May 2026 10:36:57 -0400
Subject: [PATCH 2/2] [docs][HIP] Document source-based device code coverage
 workflow

Add a section to HIPSupport.rst describing how to produce source-based
code coverage reports for HIP device code on AMD GPUs: compile with
-fprofile-instr-generate -fcoverage-mapping, extract the device ELF
from the host binary's .hip_fatbin section, unbundle with
clang-offload-bundler using the hip-amdgcn-amd-amdhsa--<arch> target
ID, and run llvm-profdata / llvm-cov against the device object.
---
 clang/docs/HIPSupport.rst   | 53 +++++++++++++++++++++++++++++++++++++
 clang/docs/ReleaseNotes.rst |  6 +++++
 2 files changed, 59 insertions(+)

diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index 99559548823b2..940598d04e346 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -1053,6 +1053,59 @@ Notes
 - Set ``LLVM_PROFILE_VERBOSE=1`` to print runtime diagnostics for
   profile file creation and device profile collection.
 
+Source-Based Code Coverage for Device Code
+==========================================
+
+Clang supports source-based code coverage for HIP device code on AMD GPUs.
+Device code is instrumented with the same ``-fprofile-instr-generate
+-fcoverage-mapping`` flags used for host code; counters live in the device
+binary, are written to a ``.profraw`` file at process exit, and can be
+consumed by ``llvm-profdata`` and ``llvm-cov``.
+
+Prerequisites
+-------------
+
+Source-based device coverage relies on the AMDGPU profile runtime, so
+the toolchain must be built with the same CMake configuration used for
+HIP offload PGO. See the *Prerequisites* subsection under
+`Profile-Guided Optimization for Device Code`_.
+
+Example
+-------
+
+Given a HIP program ``demo.hip``, the following commands produce an LCOV
+report covering device code:
+
+.. code-block:: console
+
+   $ clang++ -x hip demo.hip \
+       --offload-arch=gfx1101 \
+       -fprofile-instr-generate -fcoverage-mapping \
+       -o demo
+
+   $ llvm-objcopy --dump-section=.hip_fatbin=fatbin.bin demo
+   $ clang-offload-bundler --type=o --input=fatbin.bin \
+       --output=device.gfx1101.o \
+       --targets=hip-amdgcn-amd-amdhsa--gfx1101 --unbundle
+
+   $ LLVM_PROFILE_FILE="cov.%p.profraw" ./demo
+   $ llvm-profdata merge -sparse -o cov.profdata cov.*.profraw
+
+   $ llvm-cov report device.gfx1101.o -instr-profile=cov.profdata
+   $ llvm-cov show   device.gfx1101.o -instr-profile=cov.profdata
+   $ llvm-cov export device.gfx1101.o -instr-profile=cov.profdata \
+       -format=lcov > coverage.lcov
+
+The device ELF is extracted from the ``.hip_fatbin`` section of the host
+binary and then unbundled with ``clang-offload-bundler``. The unbundle
+target string uses the bundle ID ``hip-amdgcn-amd-amdhsa--<arch>``,
+which is the offload kind (``hip``) followed by the standard
+four-component target triple (``amdgcn-amd-amdhsa-``, with the empty
+environment field giving the trailing dash) and then the target ID
+(``<arch>``). See :doc:`ClangOffloadBundler` for the full bundle entry
+ID grammar. ``llvm-cov`` is invoked against the device object because
+the coverage mapping for device functions is emitted there.
+
 SPIR-V Support on HIPAMD ToolChain
 ==================================
 
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index d241132a4e9bf..96b8b565787eb 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -908,6 +908,12 @@ CUDA/HIP Language Changes
   with ``llvm-profdata`` and fed back via ``-Xarch_host`` /
   ``-Xarch_<gpu-arch>`` ``-fprofile-use=``. See :doc:`HIPSupport` for
   the full workflow.
+- Added source-based code coverage support for HIP device code on AMD
+  GPUs. ``-fprofile-instr-generate -fcoverage-mapping`` now instruments
+  device code; running the instrumented binary writes per-GPU
+  architecture raw profiles that can be merged with ``llvm-profdata``
+  and reported by ``llvm-cov`` against the extracted device code
+  object. See :doc:`HIPSupport` for the full workflow.
 
 CUDA Support
 ^^^^^^^^^^^^

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [docs][HIP] Document source-based device code coverage workflow (PR #200197)

Reply via email to