https://github.com/yxsamliu created https://github.com/llvm/llvm-project/pull/168566
Clarify how Clang-generated HIP fat binaries are registered and unregistered with the HIP runtime, and how this interacts with global constructors, destructors, and atexit handlers. Document that there is no strong guarantee on ordering relative to user-defined global ctors/dtors, recommend that HIP application developers avoid using kernels or device variables from global ctors/dtors, and describe the implications for HIP runtime developers (synchronization and guards in __hipRegisterFatBinary/__hipUnregisterFatBinary). This is motivated by questions from HIP application and runtime developers about fat binary registration/unregistration order and its potential interference with their own initialization and teardown code. >From e0dc0df1639603a4a28fd72d1a5da19853de12ad Mon Sep 17 00:00:00 2001 From: "Yaxun (Sam) Liu" <[email protected]> Date: Tue, 18 Nov 2025 11:46:17 -0500 Subject: [PATCH] Improve HIP docs on fat binary registration ordering Clarify how Clang-generated HIP fat binaries are registered and unregistered with the HIP runtime, and how this interacts with global constructors, destructors, and atexit handlers. Document that there is no strong guarantee on ordering relative to user-defined global ctors/dtors, recommend that HIP application developers avoid using kernels or device variables from global ctors/dtors, and describe the implications for HIP runtime developers (synchronization and guards in __hipRegisterFatBinary/__hipUnregisterFatBinary). This is motivated by questions from HIP application and runtime developers about fat binary registration/unregistration order and its potential interference with their own initialization and teardown code. --- clang/docs/HIPSupport.rst | 82 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst index ab9ea110e6d54..b33d663f0cfee 100644 --- a/clang/docs/HIPSupport.rst +++ b/clang/docs/HIPSupport.rst @@ -210,6 +210,88 @@ Host Code Compilation - These relocatable objects are then linked together. - Host code within a TU can call host functions and launch kernels from another TU. +HIP Fat Binary Registration and Unregistration +============================================= + +When compiling HIP for AMD GPUs, Clang embeds device code into HIP "fat +binaries" and generates host-side helper functions that register these +fat binaries with the HIP runtime at program start and unregister them at +program exit. In non-RDC mode (``-fno-gpu-rdc``), each compilation unit +typically produces its own self-contained fat binary per GPU architecture. In +RDC mode (``-fgpu-rdc``), device bitcode from multiple compilation units may be +linked together into a single fat binary per GPU architecture. + +At the LLVM IR level, Clang/LLVM typically create an internal module +constructor (for example ``__hip_module_ctor`` or a ``.hip.fatbin_reg`` +function) and add it to ``@llvm.global_ctors``. This constructor is called by +the C runtime before ``main`` and it: + +* calls ``__hipRegisterFatBinary`` with a pointer to an internal wrapper + object that describes the HIP fat binary; +* stores the returned handle in an internal global variable; +* calls an internal helper such as ``__hip_register_globals`` to register + kernels, device variables and other metadata associated with the fat binary; +* registers a corresponding module destructor with ``atexit`` so it will run + during program termination. + +The module destructor (for example ``__hip_module_dtor`` or a +``.hip.fatbin_unreg`` function) loads the stored handle, checks that it is +non-null, calls ``__hipUnregisterFatBinary`` to unregister the fat binary from +the HIP runtime, and then clears the handle. This ensures that the HIP runtime +sees each fat binary registered exactly once and that it is unregistered once +at exit, even when multiple translation units contribute HIP kernels to the +same host program. + +These registration/unregistration helpers are implementation details of Clang's +HIP code generation; user code should not call ``__hipRegisterFatBinary`` or +``__hipUnregisterFatBinary`` directly. + +Implications for HIP Application Developers +------------------------------------------ + +The fat binary registration and unregistration helpers participate in the same +global constructor and termination mechanisms as the rest of the program, and +there is no strong guarantee about their relative order with user-defined +global constructors and destructors. In particular: + +* Applications should not invoke ``__hipRegisterFatBinary`` or + ``__hipUnregisterFatBinary`` explicitly. +* Because registration happens in a compiler-generated module constructor and + unregistration happens via an ``atexit``-registered module destructor, the + exact ordering relative to other global ctors/dtors and ``atexit`` handlers + is implementation-dependent and may vary across platforms and toolchain + options. +* To avoid subtle ordering issues, applications should not rely on HIP kernels + or device variables being usable from user-defined global constructors or + destructors. HIP initialization and teardown that touches kernels or device + state should instead be performed in ``main`` (or in functions called from + ``main``) after process startup. +* In RDC mode, multiple translation units may contribute device code to a + single fat binary; user code should not make assumptions based on a + particular registration order between translation units. + +Implications for HIP Runtime Developers +-------------------------------------- + +HIP runtime implementations that are linked with Clang-generated host code +must handle registration and unregistration in the presence of uncertain +global ctor/dtor ordering: + +* ``__hipRegisterFatBinary`` must accept a pointer to the compiler-generated + wrapper object and return an opaque handle that remains valid for as long as + the fat binary may be used. +* ``__hipUnregisterFatBinary`` must accept the handle previously returned by + ``__hipRegisterFatBinary`` and perform any necessary cleanup. It may be + called late in process teardown, after other parts of the runtime have + started shutting down, so it should be robust in the presence of partially + torn-down state. +* Runtimes should use appropriate synchronization and guards so that fat + binary registration does not observe uninitialized resources and + unregistration does not release resources that are still required by other + runtime components. In particular, registration and unregistration routines + should be written to be safe under repeated calls and in the presence of + concurrent or overlapping initialization/teardown logic. + Syntax Difference with CUDA =========================== _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
