https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112465

            Bug ID: 112465
           Summary: libgcc: aarch64: lse runtime does not work with big
                    data segments
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgcc
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jemarch at gcc dot gnu.org
  Target Milestone: ---

While compiling and linking the STREAM benchmark
(http://www.cs.virginia.edu/stream/ref.html) in aarch64 with very big arrays,
this happens:

  $ gcc -O2 -DSTREAM_ARRAY_SIZE=178956970 -mcmodel=large -fopenmp -o stream.4gb
stream.c
  libgcc.a(lse-init.o): in function `init_have_lse_atomics':
  (.text.startup+0x14): relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21
against `.bss'
  libgcc.a(ldadd_4_1.o): in function `__aarch64_ldadd4_relax':
  (.text+0x4): relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21
  against symbol `__aarch64_have_lse_atomics' defined in .bss section in
  collect2: error: ld returned 1 exit status

The LSE machinery in libgcc relies on the fact that the global
__aarch64_have_lse_atomics is reachable within 4GiB.  This is due to
code like this:

  .macro        JUMP_IF_NOT_LSE label
        adrp    x(tmp0), __aarch64_have_lse_atomics
        ldrb    w(tmp0), [x(tmp0), :lo12:__aarch64_have_lse_atomics]
        cbz     w(tmp0), \label
  .endm

That is put in the prologue in all LSE instructions in libcc (such as
__aarch64_ldadd4_relax in the little reproducer below) and in the
initialization routine also part of libgcc:

  static void __attribute__((constructor (90)))
  init_have_lse_atomics (void)
  {
    unsigned long hwcap = __getauxval (AT_HWCAP);
    __aarch64_have_lse_atomics = (hwcap & HWCAP_ATOMICS) != 0;
  }

The code compiled for the last assignment in that function also makes use of an
instruction sequence using adrp.  The addressing mode implemented by adrp+ldrb
allows to access +-4GiB.

In the stream.c benchmark, and also in this little reproducer:

  static int foo;
  static double a[178956970],b[178956970],c[178956970];

  int main ()
  {
  #pragma omp atomic 
    foo++;
    return foo + a[0] + b[0] + c[0];
  }

The variables a, b and c get allocated as bss.  Now, it happens that
__aarch64_have_lse_atomics also goes to the bss:

  /* Define the symbol gating the LSE implementations.  */
  _Bool __aarch64_have_lse_atomics
    __attribute__((visibility("hidden"), nocommon));

But _after_ a, b and c.  So it is the offset of
__aarch64_have_lse_atomics within the bss that is overflowing the
relocation for the adrp instruction.

Reply via email to