https://github.com/yxsamliu created 
https://github.com/llvm/llvm-project/pull/168566

Clarify how Clang-generated HIP fat binaries are registered and unregistered 
with the HIP runtime, and how this interacts with global constructors, 
destructors, and atexit handlers. Document that there is no strong guarantee on 
ordering relative to user-defined global ctors/dtors, recommend that HIP 
application developers avoid using kernels or device variables from global 
ctors/dtors, and describe the implications for HIP runtime developers 
(synchronization and guards in 
__hipRegisterFatBinary/__hipUnregisterFatBinary). This is motivated by 
questions from HIP application and runtime developers about fat binary 
registration/unregistration order and its potential interference with their own 
initialization and teardown code.

>From e0dc0df1639603a4a28fd72d1a5da19853de12ad Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <[email protected]>
Date: Tue, 18 Nov 2025 11:46:17 -0500
Subject: [PATCH] Improve HIP docs on fat binary registration ordering

Clarify how Clang-generated HIP fat binaries are registered and unregistered
with the HIP runtime, and how this interacts with global constructors,
destructors, and atexit handlers. Document that there is no strong guarantee
on ordering relative to user-defined global ctors/dtors, recommend that HIP
application developers avoid using kernels or device variables from global
ctors/dtors, and describe the implications for HIP runtime developers
(synchronization and guards in __hipRegisterFatBinary/__hipUnregisterFatBinary).
This is motivated by questions from HIP application and runtime developers
about fat binary registration/unregistration order and its potential
interference with their own initialization and teardown code.
---
 clang/docs/HIPSupport.rst | 82 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index ab9ea110e6d54..b33d663f0cfee 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -210,6 +210,88 @@ Host Code Compilation
 - These relocatable objects are then linked together.
 - Host code within a TU can call host functions and launch kernels from 
another TU.
 
+HIP Fat Binary Registration and Unregistration
+=============================================
+
+When compiling HIP for AMD GPUs, Clang embeds device code into HIP "fat
+binaries" and generates host-side helper functions that register these
+fat binaries with the HIP runtime at program start and unregister them at
+program exit. In non-RDC mode (``-fno-gpu-rdc``), each compilation unit
+typically produces its own self-contained fat binary per GPU architecture. In
+RDC mode (``-fgpu-rdc``), device bitcode from multiple compilation units may be
+linked together into a single fat binary per GPU architecture.
+
+At the LLVM IR level, Clang/LLVM typically create an internal module
+constructor (for example ``__hip_module_ctor`` or a ``.hip.fatbin_reg``
+function) and add it to ``@llvm.global_ctors``. This constructor is called by
+the C runtime before ``main`` and it:
+
+* calls ``__hipRegisterFatBinary`` with a pointer to an internal wrapper
+  object that describes the HIP fat binary;
+* stores the returned handle in an internal global variable;
+* calls an internal helper such as ``__hip_register_globals`` to register
+  kernels, device variables and other metadata associated with the fat binary;
+* registers a corresponding module destructor with ``atexit`` so it will run
+  during program termination.
+
+The module destructor (for example ``__hip_module_dtor`` or a
+``.hip.fatbin_unreg`` function) loads the stored handle, checks that it is
+non-null, calls ``__hipUnregisterFatBinary`` to unregister the fat binary from
+the HIP runtime, and then clears the handle. This ensures that the HIP runtime
+sees each fat binary registered exactly once and that it is unregistered once
+at exit, even when multiple translation units contribute HIP kernels to the
+same host program.
+
+These registration/unregistration helpers are implementation details of Clang's
+HIP code generation; user code should not call ``__hipRegisterFatBinary`` or
+``__hipUnregisterFatBinary`` directly.
+
+Implications for HIP Application Developers
+------------------------------------------
+
+The fat binary registration and unregistration helpers participate in the same
+global constructor and termination mechanisms as the rest of the program, and
+there is no strong guarantee about their relative order with user-defined
+global constructors and destructors. In particular:
+
+* Applications should not invoke ``__hipRegisterFatBinary`` or
+  ``__hipUnregisterFatBinary`` explicitly.
+* Because registration happens in a compiler-generated module constructor and
+  unregistration happens via an ``atexit``-registered module destructor, the
+  exact ordering relative to other global ctors/dtors and ``atexit`` handlers
+  is implementation-dependent and may vary across platforms and toolchain
+  options.
+* To avoid subtle ordering issues, applications should not rely on HIP kernels
+  or device variables being usable from user-defined global constructors or
+  destructors. HIP initialization and teardown that touches kernels or device
+  state should instead be performed in ``main`` (or in functions called from
+  ``main``) after process startup.
+* In RDC mode, multiple translation units may contribute device code to a
+  single fat binary; user code should not make assumptions based on a
+  particular registration order between translation units.
+
+Implications for HIP Runtime Developers
+--------------------------------------
+
+HIP runtime implementations that are linked with Clang-generated host code
+must handle registration and unregistration in the presence of uncertain
+global ctor/dtor ordering:
+
+* ``__hipRegisterFatBinary`` must accept a pointer to the compiler-generated
+  wrapper object and return an opaque handle that remains valid for as long as
+  the fat binary may be used.
+* ``__hipUnregisterFatBinary`` must accept the handle previously returned by
+  ``__hipRegisterFatBinary`` and perform any necessary cleanup. It may be
+  called late in process teardown, after other parts of the runtime have
+  started shutting down, so it should be robust in the presence of partially
+  torn-down state.
+* Runtimes should use appropriate synchronization and guards so that fat
+  binary registration does not observe uninitialized resources and
+  unregistration does not release resources that are still required by other
+  runtime components. In particular, registration and unregistration routines
+  should be written to be safe under repeated calls and in the presence of
+  concurrent or overlapping initialization/teardown logic.
+
 Syntax Difference with CUDA
 ===========================
 

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to