https://sourceware.org/bugzilla/show_bug.cgi?id=33583
Bug ID: 33583
Summary: Linker plugin: claim_file_handler_v2 hook wrongly
called with 'known_usage=false' for files with common
symbols
Product: binutils
Version: unspecified
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: ld
Assignee: unassigned at sourceware dot org
Reporter: [email protected]
CC: hjl.tools at gmail dot com
Target Milestone: ---
This patch is a cloned/moved from GCC PR122432 (https://gcc.gnu.org/PR122432).
Background: For GPU offloading, the linker plugin (GCC's lto-plugin) has to
know whether a symbol is linked on the host side or not – as the device side
has to work in lockstep. On the host side and device side, an array of function
pointers is created; when an offload region is to be launched, a function
pointer is passed to the runtime - it is directly called for host fallback, but
when a device is available: If the pointer is found as n-th entry in the host
array, the associated n-th entry on the device side is invoked.
Most of the time the linker only calls the plugin when it knows that a symbol
is needed. However, for common symbols:
[…] The archive map contains a reference to this symbol […] We only want to
include it […] if this archive element contains a definition of the symbol,
not just another common declaration of it. Unfortunately some archivers
(including GNU ar) will put declarations of common symbols into their archive
maps, as well as real definitions
This comment is from bfd/elflink.c's elf_link_add_archive_symbols and for
the stated reasons, it calls elf_link_is_defined_archive_symbol.
This caused surprisingly often offload fails – by adding unused symbols on
the device side → GCC PR109128, https://gcc.gnu.org/PR109128 . Hence, since
b21318bd2c2 Add LDPT_REGISTER_CLAIM_FILE_HOOK_V2 linker plugin hook [GCC
PR109128]
the linker passes a 'known_used' flag to the hook – which is false when calling
elf_link_is_defined_archive_symbol.
Most of the time that's fine: Either it is not yet really there (h == NULL,
i.e. the mentioned code block is not reached) – or it is known to be used - or
some other code is processed first that is used.
* * *
However, in some cases elf_link_is_defined_archive_symbol gets called first -
and later some symbol is actually used.
As the hook is only called once per translation unit, the offload processing
never happens - and the host and device side become out of sync.
* * *
Possible fix (aka as patch):
https://sourceware.org/pipermail/binutils/2025-October/145158.html
* * *
Testcase: As mentioned, a lot of conditions have to be met in order to trigger
that bug. This turned out to be surprisingly complex. Thus, the testcase is a
bit larger than hoped for (but it surely can be reduced more if one spends a
couple of hours on it).
Prerequisites:
- You need GCC Fortran compiler that has support for the v2 hook (i.e. GCC >=
14)
- This compiler has to be configured for either nvptx or gcn offloading,
cf. https://gcc.gnu.org/wiki/Offloading (for distro builds, optional
packages need to be installed; for self builds - see link).
Note that no vendor libraries are required at compile time (nor at
runtime, unless you want to use an offload device).
There are two ways to test it:
* If you have an offload device, compile the program and run it.
In the error case, it will fail before reaching main with:
libgomp: Cannot map target functions or variables (expected 2 + 0 + 1, have
2)'
* The following assumes that you have nvptx and/or gcn offloading
configured but don't have a suitable offload device on your system.
This works by checking the files (requires -save-temps for stable
names).
[The test is written such that it works if only nvptx or only gcn
or both are configured.]
Testcase: https://gcc.gnu.org/bugzilla/attachment.cgi?id=62650 (.tar.gz)
This file contains the 4 Fortran files. The 'build.sh' shows how to
build and link them. It also contains the mentioned symbol checks
by grepping some files.
For the offload side, the available symbols are checked by grepping the
compiler-generated *.c file that contains a constructor that is used to
register the offload data with the device-specific libgomp plugin.
Expected are the two offload function symbols, but only one is there (bug!).
For the host side, the __OFFLOAD_TABLE__ contains the functions as:
void **host_func_table = ((void ***) __OFFLOAD_TABLE__)[0];
void **host_funcs_end = ((void ***) __OFFLOAD_TABLE__)[1];
int num_funcs = host_funcs_end - host_func_table;
However, for simplicity, the build.sh script just checks with 'nm' whether the
respective expected functions are in the executable (host side).
On the offload/device side, only one of the functions is available. Namely the
following one from psb_d_oacc_vect_mod-3.f90:
d_inner_oacc_amax.0._omp_fn.0
or on nvptx (no '.' permitted in the name)
d_inner_oacc_amax$0$_omp_fn$0
On the host side, there is:
d_inner_oacc_amax.0._omp_fn
d_inner_oacc_mlt_v_2.0._omp_fn
The second function is missing on the device side if the otherwise unused
common symbols are included. If you comment the "use my_mpi" line, it works
fine and symbol shows up.
The mlt_v symbol comes from the file psb_d_oacc_mlt_v_2-2.f90, the 'use my_mpi'
is in psb_d_oacc_vect_mod-3.f90 – but is pulled into the former file by using
the module: 'use psb_d_oacc_vect_mod'.
--
You are receiving this mail because:
You are on the CC list for the bug.