[PATCH 00/10] Add FP overloads for __atomic_fetch_add etc

mmalcomson Mon, 11 Aug 2025 04:09:53 -0700

From: Matthew Malcomson <mmalcom...@nvidia.com>

Cc'ing in middle-end maintainers since I *think* that is the best group
for the atomics machinery.  Would appreciate a pointer if someone else
would be better to Cc in.


Cc'ing in Joseph Myers since he's been very helpful w.r.t. floating
point and libatomic so far.

Dropping Jonathan Wakely from Cc because libstdc++ related things are
now in so avoiding the unnecessary ping.

Rebase & tweak of atomic fp fetch_{add,sub} patch series posted last
year:
https://gcc.gnu.org/pipermail/gcc-patches/2024-November/668754.html

Changes from the last series are minor:
1) Rebased onto Prathamesh's automatic libatomic linking patch.
   Latest version of that is found at the link below (though this patch
   series was rebased onto the version one before this).
   https://gcc.gnu.org/pipermail/gcc-patches/2025-August/692287.html
2) Fixed some typos in comments.
3) Removed the "work without libatomic" flag that Joseph pointed out was
   unnecessary.  Chose not to use include it solely for testing.
4) Made the documentation changes suggested in last review.
   - N.b. that patch has been approved, so not Cc'ing anyone on that
     patch event hough sending it upstream.
5) Added a patch for avoiding problems in x86_64 libstdc++.
   (Would appreciate extra attention on this patch -- it modifies a
   target hook in a backend that I'm not familiar with).

On top of that the context around the patch has changed a bit, so cover
letter adjusted below:

This patchset introduces floating point versions of atomic fetch_add,
fetch_sub, add_fetch and sub_fetch.  Instructions for performing these
operations have been directly available in GPU hardware for a while, and
are now starting to get added to CPU ISA's with instructions like the
AArch64 LDFADD.  Clang has allowed floating point types to be used with
these builtins for a while now https://reviews.llvm.org/D71726.

Introducing these new overloads to this builtin allows users to directly
specify the operation needed and hence allows the compiler to provide
optimised output if possible.

There is additional motivation to use such floating point type atomic
operations in libstdc++ so that other compilers can use libstdc++ to
generate optimal code for their own targets (e.g. NVC++ can use
libstdc++ atomic<float>::fetch_add to generate optimal code for GPU's
when using the `-stdpar` argument).  Jonathan Wakely has already posted
a patch introducing the use of these builtins into libstdc++ when they
are available.

We intend to post a patch using the new AArch64 instructions later in
this release cycle.

------------------------------
As standard with the existing atomic builtins, we add the same functions
in libatomic, allowing a fallback for when a given target has not
implemented these operations directly in hardware.  In order to use
these functions we need to have a naming scheme that encodes the type --
we use a suffix of _fp to denote that this operation is on a floating
point type, and a further empty suffix to denote a double, 'f' to denote
a float, and similar.  The scheme for the second part of the suffix
taken from the existing builtins that have different versions for
different floating point types -- e.g.  __builtin_acosh,
__builtin_acoshf, __builtin_acoshl, etc.

In order to add floating point functions to libatomic we updated the
makefile machinery to use names more descriptive of the new setup (where
the SIZE of the datatype can no longer be used to distinguish all
operations from each other).  Moreover we add a CAS loop implementation
in fop_n.c that handles floating point exception information and handles
casting between floating point and integral types when switching between
applying the operation and using CAS to attempt to store.

------------------------------
As Joseph Myers pointed out in response to my RFC, when performing
floating point operations in a CAS loop there is the floating point
exception information to take care of.  In order to take care of this
information I use the existing `atomic_assign_expand_fenv` target hook
to generate code that checks this information.

Partly due to the fact that this hook emits GENERIC code and partly due
to the language-specific semantics of floating point exceptions, this
means we now decide whether to emit a CAS loop handling the frontend
(during overload resolution).  The frontend decides to only use the
underlying builtin if the backend has an optab defined that can
implement it directly.

------------------------------
Now that the expansion to a CAS loop is performed in overloaded builtin
resolution, this means that if the user were to directly use a resolved
version (e.g. `__atomic_fetch_add_fp` for a double) that would not
expand into a CAS loop inline.  Instead (assuming the optab is not
implemented for this target) it would pass through and end up using the
libatomic fallback.

This is not ideal, but I believe the complexity of adding another clause
for this expansion to a CAS loop is not worth the benefit of handling a
CAS loop expansion for this specific case (partly on the assumption that
users would rarely specify the resolved version and partly on the belief
that these resolved versions are not actually part of the user-facing
interface -- since they're not documented in the manual and don't seem
to be used enough for clang to expose the interface).

I considered not exposing the resolved versions to the user (similar to
the interface that _BitInt exposes) and instead handling them as an
internal function that could expand to call the libatomic
implementation.  I chose not to do that for consistency with the rest of
the atomic builtins.

------------------------------
There are a few places throughout the compiler that handle such atomic
builtins and I have not updated to handle floating point atomic
builtins.  Places like asan, tsan, gimple-ssa-warn-access, analyzer, and
tree-ssa-forwprop would need to be updated eventually.
However since the current state of GCC is that no backend implements
these optabs directly the generic version of the builtin is always
expanded as a CAS loop in the frontend -- this means these mid-end
passes will not see any of these builtins except in the case that the
user explicitly calls the resolved version.
I hoping to update these places in a later patch (the patch where we
introduce the backend expansions).

------------------------------
Without adjustment, ix86_atomic_assign_expand_fenv generates code that
gets broken during optimisation by `fold`.  I believe the code returned
by this function was incorrect (maybe only bad for C++?).
The expression that gets incorrectly optimised is along the lines of:
COMPOUND_EXPR<TARGET_EXPR<var1, some-init>,
              TARGET_EXPR<var2, expression-with-var1>>
and `fold` (which gets called by `cp_fold`) removes the first
TARGET_EXPR since it doesn't look like it has side effects (even though
the variable it sets is used in the second expression).  Adding
`TREE_SIDE_EFFECTS` markers to this expression avoids the problem.

------------------------------
Testing done:
  Bootstrap and regression test passes on x86_64 and AArch64 (when
  run on top of the libatomic autoinclude patch that Prathamesh has
  posted).
  Cross compiler regression tests pass on arm-linux.
  Cross compiler regressino tests on AArch64 linux with Qemu emulating a
  machine that does not have LSE.
  Similarly tested with a dummy implementation of fetch_add as an optab
  in the AArch64 backend to ensure that codepath also works.

------------------------------

Matthew Malcomson (10):
  libatomic: Split concept of SUFFIX and SIZE in libatomic
  libatomic: Add floating point implementations of fetch_{add,sub}
  c: c++: Define new floating point builtin fetch_add functions
  builtins: Add FP types for atomic builtin overload resolution
  c: c++: Expand into CAS loop in frontend
  builtins: optab: Tie the new atomic builtins to the backend
  testsuite: Add tests for fp resolutions of __atomic_fetch_add
  doc: Mention floating point atomic fetch_add etc in docs
  [Not For Commit] Add demo implementation of one of the operations
  i386: Mark a tree node in i386.cc as TREE_SIDE_EFFECTS

 gcc/builtin-types.def                         |   20 +
 gcc/builtins.cc                               |  176 ++
 gcc/builtins.h                                |    2 +
 gcc/c-family/c-common.cc                      |  217 ++-
 gcc/config/aarch64/aarch64.h                  |    2 +
 gcc/config/aarch64/aarch64.opt                |    5 +
 gcc/config/aarch64/atomics.md                 |   15 +
 gcc/config/i386/i386.cc                       |   17 +
 gcc/doc/extend.texi                           |    9 +
 gcc/fortran/f95-lang.cc                       |    5 +
 gcc/fortran/types.def                         |   17 +
 gcc/optabs.cc                                 |   19 +
 gcc/optabs.def                                |    6 +-
 gcc/sync-builtins.def                         |   40 +
 .../template/builtin-atomic-overloads.def     |   28 +-
 .../template/builtin-atomic-overloads6.C      |   23 +-
 .../template/builtin-atomic-overloads7.C      |   16 +-
 gcc/testsuite/gcc.dg/atomic-op-fp-convert.c   |    6 +
 gcc/testsuite/gcc.dg/atomic-op-fp-errs.c      |   14 +
 .../gcc.dg/atomic-op-fp-resolve-complain.c    |    5 +
 gcc/testsuite/gcc.dg/atomic-op-fp.c           |  198 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf.c          |  198 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf128.c       |  201 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf16.c        |  201 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf16b.c       |  201 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf32.c        |  201 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf32x.c       |  201 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf64.c        |  201 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf64x.c       |  201 +++
 gcc/testsuite/gcc.dg/atomic-op-fpl.c          |  198 +++
 .../gcc.dg/atomic/atomic-op-fp-fenv.c         |  376 +++++
 .../gcc.target/i386/excess-precision-13.c     |   87 +
 gcc/testsuite/lib/target-supports.exp         |  199 ++-
 libatomic/Makefile.am                         |   46 +-
 libatomic/Makefile.in                         |   49 +-
 libatomic/acinclude.m4                        |   56 +-
 libatomic/auto-config.h.in                    |  114 +-
 libatomic/cas_n.c                             |    8 +-
 libatomic/config/linux/aarch64/host-config.h  |   23 +-
 libatomic/config/linux/arm/host-config.h      |    2 +-
 libatomic/config/s390/cas_n.c                 |    6 +-
 libatomic/config/s390/exch_n.c                |    4 +-
 libatomic/config/s390/load_n.c                |    4 +-
 libatomic/config/s390/store_n.c               |    4 +-
 libatomic/config/x86/host-config.h            |   14 +-
 libatomic/configure                           | 1485 ++++++++++++++++-
 libatomic/configure.ac                        |    6 +
 libatomic/exch_n.c                            |   12 +-
 libatomic/fadd_n.c                            |   23 +-
 libatomic/fop_n.c                             |  111 +-
 libatomic/fsub_n.c                            |   23 +
 libatomic/libatomic.map                       |   44 +
 libatomic/libatomic_i.h                       |  186 ++-
 libatomic/load_n.c                            |   12 +-
 libatomic/store_n.c                           |   12 +-
 libatomic/tas_n.c                             |   12 +-
 libatomic/testsuite/Makefile.in               |    1 +
 .../testsuite/libatomic.c/atomic-op-fp-fenv.c |  421 +++++
 .../testsuite/libatomic.c/atomic-op-fp.c      |  219 +++
 .../testsuite/libatomic.c/atomic-op-fpf.c     |  219 +++
 .../testsuite/libatomic.c/atomic-op-fpf128.c  |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf16.c   |  223 +++
 .../testsuite/libatomic.c/atomic-op-fpf16b.c  |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf32.c   |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf32x.c  |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf64.c   |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpf64x.c  |  220 +++
 .../testsuite/libatomic.c/atomic-op-fpl.c     |  219 +++
 68 files changed, 7943 insertions(+), 240 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-convert.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-errs.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp-resolve-complain.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf128.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16b.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32x.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64x.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpl.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic/atomic-op-fp-fenv.c
 create mode 100644 gcc/testsuite/gcc.target/i386/excess-precision-13.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fp-fenv.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fp.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf128.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16b.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32x.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64x.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpl.c

-- 
2.43.0

[PATCH 00/10] Add FP overloads for __atomic_fetch_add etc

Reply via email to