target

Prathamesh Kulkarni Wed, 28 Jan 2026 17:59:58 -0800

> -----Original Message-----
> From: [email protected] <[email protected]>
> Sent: 26 November 2025 18:42
> To: [email protected]; Jakub Jelinek <[email protected]>; Tobias
> Burnus <[email protected]>
> Cc: Julian Brown <[email protected]>; Thomas Schwinge
> <[email protected]>; Andrew Stubbs <[email protected]>; Tom de
> Vries <[email protected]>; Sebastian Huber <sebastian.huber@embedded-
> brains.de>; Matthew Malcomson <[email protected]>
> Subject: [PATCH 3/5] libgomp: Implement "flat" barrier for linux/
> target
> 
> External email: Use caution opening links or attachments
> 
> 
> From: Matthew Malcomson <[email protected]>
> 
> Working on PR119588.  From that PR:
> We've seen on some internal workloads (NVPL BLAS running GEMM routine
> on a small matrix) that the overhead of a `#pragma omp parallel`
> statement when running with a high number of cores (72 or 144) is much
> higher with the libgomp implementation than with LLVM's libomp.
> 
> In a program which has both some work that can be handled with high
> parallelism (so OMP is running with many threads) and a large number
> of small pieces of work that need to be performed with low overhead,
> this has been seen to cause a significant overhead when accumulated.
> 
> ------------------------------
> Points that I would like feedback on:
> - Is this extra complexity OK, especially for the futex_waitv
>   fallback?
>   - I have thought about creating yet another target (in the same way
>     that the linux/ target is for kernels where `futex` is available).
>     Would this approach be favoured?
> - Are the changes in `gomp_team_start` significant enough for other
>   targets that it's worth bringing it into the linux/ target?
>   Similar to what I see the nvptx/ target has done for
>   `gomp_team_start`  Could have some `ifdef` for the alternate barrier
>   API rather than try to modify all the other interfaces to match a
>   slightly modified API.
>   - Note that the extra changes required here "disappear" in the
> second
>     patch in this patch series.  This because the second patch in this
>     series removes the "simple-barrier" from the main loop in
>     gomp_thread_start for non-nested threads.  Since that simple-
> barrier
>     is not in the main loop it can be a slower-but-simpler version of
>     the barrier that can very easily be changed to handle a different
>     number of threads.  Hence the extra considerations around
> adjusting
>     the number of threads needed in this patch are much less.
> - As it stands I've not adjusted other targets except posix/ to work
>   with this alternate approach as I hope that the second patch in the
>   series will also be accepted (removing the need to implement some
>   slightly tricky functionality in each backend).
>   - Please do mention if this seems acceptable.
>   - I adjusted the posix/ target since I could test it and I wanted to
>     ensure that the new API could work for other targets.
> 
> ------------------------------
> This patch introduces a barrier that performs better on the
> microbenchmark provided in PR119588.  I have ran various higher-level
> benchmarks and have not seen anything I can claim as outside noise.
> 
> The outline of the barrier is:
> 1) Each thread associated with a barrier has its own ID (currently the
>    team_id of the team that the current thread is a part of).
> 2) When a thread arrives at the barrier it marks its "arrived flag" as
>    having arrived.
> 3) ID 0 (which is always the thread which does the bookkeeping for
>    this team, at the top-level it's the primary thread) waits on each
>    of the other threads "arrived flags".  It executes tasks while
>    waiting on these threads to arrive.
> 4) Once ID 0 sees all other threads have arrived it signals to all
>    secondary threads that the barrier is complete via incrementing the
>    global generation (as usual).
> 
> Fundamental difference that affects performance being that we have
> thread-local "arrived" flags instead of a single counter.  That means
> there is less contention and appears to speed up the barrier
> significantly when threads are hitting the barrier very hard.
> 
> Other interesting differences are:
> 1) The "coordinating" thread is pre-determined rather than "whichever
>    thread hits the barrier last".
>    - This has some knock-on effects w.r.t. how the barrier is used.
>      (See e.g. the changes in work.c).
>    - This also means that the behaviour of the pre-determined
>      "coordinating" thread must be more complicated while it's waiting
>      on other threads to arrive.  Especially see the task handling in
>      `gomp_team_barrier_ensure_last`.
>    - Because we assign the "coordinating" thread to be the primary
>      thread this does mean that in-between points (3) and (4) above
>      the primary thread can see the results of the operations of all
>      secondary threads.  A point important for another optimisation I
>      hope to make later where we only go through one barrier per
>      iteration in the main execution loop of `gomp_thread_start` for
>      top-level threads.
> 2) When a barrier needs to be adjusted so that different threads have
>    different ID's we must serialise all threads before this
> adjustment.
>    - The previous design had no assignment of threads to some ID which
>      meant the `gomp_simple_barrier_reinit` function in
>       `gomp_team_start` simply needed to handle increasing or
>      decreasing the number of threads a barrier is taking care of.
> 
> The two largest complexity pieces of work here are related to those
> differences.
> 
> The work in `gomp_team_barrier_ensure_last` (and the cancel variant)
> needs to watch for either the case of a secondary thread arriving or
> of a task being enqueued.  When neither of these events occur for a
> while this necessitates the use of `futex_waitv` or the fallback
> approach to wait on one of two events happening.
> - N.b. this fallback implementation for futex_waitv is particularly
>   complex.  Needed for those linux kernels older than 5.16 which don't
>   have the `futex_waitv` syscall.  It is the only time where any
>   thread except the "primary" (for this team) adjusts
>   `bar->generation` while in the barrier, and it is the only time
>   where any thread modifies the "arrive flag" of some other thread.
>   Logic for this is that after deciding to stop spinning on a given
>   secondary threads "arrived" flag the primary thread sets a bit on
> said
>   secondary threads "arrived" flag to indicate that the primary thread
>   is waiting on this thread.  Then the primary thread does a
>   `futex_wait` on `bar->generation`.  If the secondary thread sees
>   this bit set on its "arrived" it sets another bit
>   `BAR_SECONDARY_ARRIVED` on `bar->generation` to wake up the primary
>   thread.  Other threads performing tasking ignore that bit.
> - The possibility of some secondary thread adjusting `bar->generation`
>   without taking any lock means that the `gomp_team_barrier_*`
>   functions used in general code that modify this variable all need to
>   now be atomic.  These memory modifications don't need to synchronise
>   with each other -- they are sending signals to completely different
>   places -- but we need to make sure that the RMW is atomic and
>   doesn't overwrite a bit set elsewhere.
> - Since we are now adjusting bar->generation atomically, we can do the
>   same in `gomp_team_barrier_cancel` and avoid taking the task_lock
>   mutex there.
>   - N.b. I believe we're also avoiding some existing UB since there is
>     an atomic store to `bar->generation` in the gomp_barrier_wait_end
>     variations done outside of the `task_lock` mutex.
>     In C I believe that any write & read that are not ordered by
>     "happens-before" of a single variable must be atomic to avoid UB.
>     Even if on the architectures that libgomp supports the reads and
>     writes actually emitted would be atomic and any data race is
> benign.
> 
> The adjustment of how barriers are initialised in `gomp_team_start` is
> related to the second point above.  If we are removing some threads
> and adding others to the thread pool then we need to serialise
> execution before adjusting the barrier state to be ready for some new
> threads to take the original team ID's.
> This is a clear cost to pay for this different barrier, but it does
> not seem to have cost much in the benchmarks I've ran.
> - To note again, this cost is not in the final version of libgomp
> after
>   the entirety of the patch series I'm posting.  That because in the
>   second patch we no longer need to adjust the size of one of these
>   alternate barrier implementations.
> 
> ------------------------------
> I've had to adjust the interface to the other barriers.  Most of the
> cases are simply to add an extra unused parameter.  There are also
> some dummied out functions to add.
> 
> The biggest difference to other barrier interfaces is in the
> reinitialisation steps.  The posix/ barrier has a reinit function that
> adjusts the number of threads it will wait on.  There is no
> serialisation required in order to perform this re-init.  Instead,
> when finishing some threads and starting others we increase the
> "total" to handle the total number then once all threads have arrived
> at the barrier decrease the "total" to be correct for "the next time
> around".
> - Again worth remembering that these extra reinit functions disappear
>   when the second patch of this series is applied.
> 
> Testing done:
> - Bootstrap & regtest on aarch64 and x86_64.
>   - With & without _LIBGOMP_CHECKING_.
>   - Testsuite with & without OMP_WAIT_POLICY=passive
>   - With and without configure `--enable-linux-futex=no` for posix
>     target.
>   - Forcing use of futex_waitv syscall, and of fallback.
> - Cross compilation & regtest on arm.
> - TSAN done on this as part of all my upstream patches.
Hi,
ping: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702031.html

Thanks,
Prathamesh
> 
> libgomp/ChangeLog:
> 
>         * barrier.c (GOMP_barrier, GOMP_barrier_cancel): Pass team_id
>         to barrier functions.
>         * config/linux/bar.c (gomp_barrier_ensure_last): New.
>         (gomp_assert_and_increment_flag): New.
>         (gomp_team_barrier_ensure_last): New.
>         (gomp_assert_and_increment_cancel_flag): New.
>         (gomp_team_barrier_ensure_cancel_last): New.
>         (gomp_barrier_wait_end): Extra assertions.
>         (gomp_barrier_wait,gomp_team_barrier_wait,
>         gomp_team_barrier_wait_cancel): Rewrite to use corresponding
>         gomp_barrier_ensure_last variant.
>         (gomp_barrier_wait_last): Rewrite for new parameter.
>         (gomp_team_barrier_wait_end): Update to handle new barrier
> flags
>         & new parameter.
>         (gomp_team_barrier_wait_cancel_end): Likewise.
>         (gomp_team_barrier_wait_final): Delegate to
>         gomp_team_barrier_wait.
>         (gomp_team_barrier_cancel): Rewrite to account for futex_waitv
>         fallback.
>         * config/linux/bar.h (struct gomp_barrier_t): Add
> `threadgens`,
>         remove `awaited` and `awaited_final`.
>         (BAR_SECONDARY_ARRIVED): New.
>         (BAR_SECONDARY_CANCELLABLE_ARRIVED): New.
>         (BAR_CANCEL_INCR): New.
>         (BAR_TASK_PENDING): Shift to accomodate new bits.
>         (BAR_INCR): Shift to accomodate new bits.
>         (BAR_FLAGS_MASK): New.
>         (BAR_GEN_MASK): New.
>         (BAR_BOTH_GENS_MASK): New.
>         (BAR_CANCEL_GEN_MASK): New.
>         (BAR_INCREMENT_CANCEL): New.
>         (PRIMARY_WAITING_TG): New.
>         (gomp_assert_seenflags): New.
>         (gomp_barrier_has_space): New.
>         (gomp_barrier_minimal_reinit): New.
>         (gomp_barrier_init): Account for threadgens, remove awaited
> and
>         awaited_final.
>         (gomp_barrier_reinit): Remove.
>         (gomp_barrier_reinit_1): New.
>         (gomp_barrier_destroy): Free new `threadgens` member.
>         (gomp_barrier_wait): Add new `id` parameter.
>         (gomp_barrier_wait_last): Likewise.
>         (gomp_barrier_wait_end): Likewise.
>         (gomp_team_barrier_wait): Likewise.
>         (gomp_team_barrier_wait_final): Likewise.
>         (gomp_team_barrier_wait_cancel): Likewise.
>         (gomp_team_barrier_wait_cancel_end): Likewise.
>         (gomp_barrier_reinit_2): New (dummy) implementation.
>         (gomp_team_barrier_ensure_last): New declaration.
>         (gomp_barrier_ensure_last): New declaration.
>         (gomp_team_barrier_ensure_cancel_last): New declaration.
>         (gomp_assert_and_increment_flag): New declaration.
>         (gomp_assert_and_increment_cancel_flag): New declaration.
>         (gomp_barrier_wait_start): Rewrite.
>         (gomp_barrier_wait_cancel_start): Rewrite.
>         (gomp_barrier_wait_final_start): Rewrite.
>         (gomp_reset_cancellable_primary_threadgen): New.
>         (gomp_team_barrier_set_task_pending): Make atomic.
>         (gomp_team_barrier_clear_task_pending): Make atomic.
>         (gomp_team_barrier_set_waiting_for_tasks): Make atomic.
>         (gomp_team_barrier_waiting_for_tasks): Make atomic.
>         (gomp_increment_gen): New to account for cancel/non-cancel
>         distinction when incrementing generation.
>         (gomp_team_barrier_done): Use gomp_increment_gen.
>         (gomp_barrier_state_is_incremented): Use gomp_increment_gen.
>         (gomp_barrier_has_completed): Load generation atomically.
>         (gomp_barrier_prepare_reinit): New.
>         * config/linux/wait.h (FUTEX_32): New.
>         (spin_count): Extracted from do_spin.
>         (do_spin): Use spin_count.
>         * config/posix/bar.c (gomp_barrier_wait,
>         gomp_team_barrier_wait_end, gomp_team_barrier_wait_cancel_end,
>         gomp_team_barrier_wait, gomp_team_barrier_wait_cancel): Add
> and
>         pass through new parameters.
>         * config/posix/bar.h (gomp_barrier_wait,
> gomp_team_barrier_wait,
>         gomp_team_barrier_wait_cancel,
>         gomp_team_barrier_wait_cancel_end, gomp_barrier_wait_start,
>         gomp_barrier_wait_cancel_start, gomp_team_barrier_wait_final,
>         gomp_barrier_wait_last, gomp_team_barrier_done): Add and pass
>         through new id parameter.
>         (gomp_barrier_prepare_reinit): New.
>         (gomp_barrier_minimal_reinit): New.
>         (gomp_barrier_reinit_1): New.
>         (gomp_barrier_reinit_2): New.
>         (gomp_barrier_has_space): New.
>         (gomp_team_barrier_ensure_last): New.
>         (gomp_team_barrier_ensure_cancel_last): New.
>         (gomp_reset_cancellable_primary_threadgen): New.
>         * config/posix/simple-bar.h (gomp_simple_barrier_reinit):
>         (gomp_simple_barrier_minimal_reinit): New.
>         (gomp_simple_barrier_reinit_1): New.
>         (gomp_simple_barrier_wait): Add new parameter.
>         (gomp_simple_barrier_wait_last): Add new parameter.
>         (gomp_simple_barrier_prepare_reinit): New.
>         (gomp_simple_barrier_reinit_2): New.
>         (gomp_simple_barrier_has_space): New.
>         * libgomp.h (gomp_assert): New.
>         (gomp_barrier_handle_tasks): Add new parameter.
>         * single.c (GOMP_single_copy_start): Pass new parameter.
>         (GOMP_single_copy_end): Pass new parameter.
>         * task.c (gomp_barrier_handle_tasks): Add and pass new
>         parameter through.
>         (GOMP_taskgroup_end): Pass new parameter.
>         (GOMP_workshare_task_reduction_unregister): Pass new
> parameter.
>         * team.c (gomp_thread_start): Pass new parameter.
>         (gomp_free_pool_helper): Likewise.
>         (gomp_free_thread): Likewise.
>         (gomp_team_start): Use new API for reinitialising a barrier.
>         This requires passing information on which threads are being
>         reused and which are not.
>         (gomp_team_end): Pass new parameter.
>         (gomp_pause_pool_helper): Likewise.
>         (gomp_pause_host): Likewise.
>         * work.c (gomp_work_share_end): Directly use
>         `gomp_team_barrier_ensure_last`.
>         (gomp_work_share_end_cancel): Similar for
>         gomp_team_barrier_ensure_cancel_last but also ensure that
>         thread local generations are reset if barrier is cancelled.
>         * config/linux/futex_waitv.h: New file.
>         * testsuite/libgomp.c/primary-thread-tasking.c: New test.
> 
> Signed-off-by: Matthew Malcomson <[email protected]>
> ---
>  libgomp/barrier.c                             |   4 +-
>  libgomp/config/linux/bar.c                    | 660 ++++++++++++++++-
> -
>  libgomp/config/linux/bar.h                    | 403 +++++++++--
>  libgomp/config/linux/futex_waitv.h            | 129 ++++
>  libgomp/config/linux/wait.h                   |  15 +-
>  libgomp/config/posix/bar.c                    |  29 +-
>  libgomp/config/posix/bar.h                    | 101 ++-
>  libgomp/config/posix/simple-bar.h             |  39 +-
>  libgomp/libgomp.h                             |  11 +-
>  libgomp/single.c                              |   4 +-
>  libgomp/task.c                                |  12 +-
>  libgomp/team.c                                | 127 +++-
>  .../libgomp.c/primary-thread-tasking.c        |  80 +++
>  libgomp/work.c                                |  26 +-
>  14 files changed, 1451 insertions(+), 189 deletions(-)
>  create mode 100644 libgomp/config/linux/futex_waitv.h
>  create mode 100644 libgomp/testsuite/libgomp.c/primary-thread-
> tasking.c
> 
> diff --git a/libgomp/barrier.c b/libgomp/barrier.c
> index 244dadd1adb..12e54a1b2b6 100644
> --- a/libgomp/barrier.c
> +++ b/libgomp/barrier.c
> @@ -38,7 +38,7 @@ GOMP_barrier (void)
>    if (team == NULL)
>      return;
> 
> -  gomp_team_barrier_wait (&team->barrier);
> +  gomp_team_barrier_wait (&team->barrier, thr->ts.team_id);
>  }
> 
>  bool
> @@ -50,5 +50,5 @@ GOMP_barrier_cancel (void)
>    /* The compiler transforms to barrier_cancel when it sees that the
>       barrier is within a construct that can cancel.  Thus we should
>       never have an orphaned cancellable barrier.  */
> -  return gomp_team_barrier_wait_cancel (&team->barrier);
> +  return gomp_team_barrier_wait_cancel (&team->barrier, thr-
> >ts.team_id);
>  }
> diff --git a/libgomp/config/linux/bar.c b/libgomp/config/linux/bar.c
> index 9eaec0e5f23..25f7e04dd16 100644
> --- a/libgomp/config/linux/bar.c
> +++ b/libgomp/config/linux/bar.c
> @@ -29,15 +29,76 @@
> 
>  #include <limits.h>
>  #include "wait.h"
> +#include "futex_waitv.h"
> 
> +void
> +gomp_barrier_ensure_last (gomp_barrier_t *bar, unsigned id,
> +                         gomp_barrier_state_t state)
> +{
> +  gomp_assert (id == 0, "Calling ensure_last in thread %u", id);
> +  gomp_assert (!(state & BAR_CANCELLED),
> +              "BAR_CANCELLED set when using plain barrier: %u",
> state);
> +  unsigned tstate = state & BAR_GEN_MASK;
> +  struct thread_lock_data *arr = bar->threadgens;
> +  for (unsigned i = 1; i < bar->total; i++)
> +    {
> +      unsigned tmp;
> +      do
> +       {
> +         do_wait ((int *) &arr[i].gen, tstate);
> +         tmp = __atomic_load_n (&arr[i].gen, MEMMODEL_ACQUIRE);
> +         gomp_assert (tmp == tstate || tmp == (tstate + BAR_INCR),
> +                      "Thread-local state seen to be %u"
> +                      " when global gens are %u",
> +                      tmp, tstate);
> +       }
> +      while (tmp != (tstate + BAR_INCR));
> +    }
> +  gomp_assert_seenflags (bar, false);
> +}
> 
>  void
> -gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
> state)
> +gomp_assert_and_increment_flag (gomp_barrier_t *bar, unsigned id,
> unsigned gens)
>  {
> +  /* Our own personal flag, so don't need to atomically read it
> except in the
> +     case of the fallback code for when running on an older Linux
> kernel
> +     (without futex_waitv).  Do need to atomically update it for the
> primary
> +     thread to read and form an acquire-release synchronisation from
> our thread
> +     to the primary thread.  */
> +  struct thread_lock_data *arr = bar->threadgens;
> +  unsigned orig = __atomic_fetch_add (&arr[id].gen, BAR_INCR,
> MEMMODEL_RELEASE);
> +  futex_wake ((int *) &arr[id].gen, INT_MAX);
> +  /* This clause is only to handle the fallback when `futex_waitv` is
> not
> +     available on the kernel we're running on.  For the logic of this
> +     particular synchronisation see the comment in
> +     `gomp_team_barrier_ensure_last`.  */
> +  unsigned gen = gens & BAR_GEN_MASK;
> +  if (__builtin_expect (orig == (gen | PRIMARY_WAITING_TG), 0))
> +    {
> +      __atomic_fetch_or (&bar->generation, BAR_SECONDARY_ARRIVED,
> +                        MEMMODEL_RELEASE);
> +      futex_wake ((int *) &bar->generation, INT_MAX);
> +    }
> +  else
> +    {
> +      gomp_assert (orig == gen, "Original flag %u != generation of
> %u\n", orig,
> +                  gen);
> +    }
> +}
> +
> +void
> +gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
> state,
> +                      unsigned id)
> +{
> +  gomp_assert (!(state & BAR_CANCELLED),
> +              "BAR_CANCELLED set when using plain barrier: %u",
> state);
>    if (__builtin_expect (state & BAR_WAS_LAST, 0))
>      {
> -      /* Next time we'll be awaiting TOTAL threads again.  */
> -      bar->awaited = bar->total;
> +      gomp_assert (id == 0, "Id %u believes it is last\n", id);
> +      /* Shouldn't have anything modifying bar->generation at this
> point.  */
> +      gomp_assert ((bar->generation & ~BAR_BOTH_GENS_MASK) == 0,
> +                  "flags set in gomp_barrier_wait_end: %u",
> +                  bar->generation & ~BAR_BOTH_GENS_MASK);
>        __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
>                         MEMMODEL_RELEASE);
>        futex_wake ((int *) &bar->generation, INT_MAX);
> @@ -51,9 +112,12 @@ gomp_barrier_wait_end (gomp_barrier_t *bar,
> gomp_barrier_state_t state)
>  }
> 
>  void
> -gomp_barrier_wait (gomp_barrier_t *bar)
> +gomp_barrier_wait (gomp_barrier_t *bar, unsigned id)
>  {
> -  gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
> +  gomp_barrier_state_t state = gomp_barrier_wait_start (bar, id);
> +  if (__builtin_expect (state & BAR_WAS_LAST, 0))
> +    gomp_barrier_ensure_last (bar, id, state);
> +  gomp_barrier_wait_end (bar, state, id);
>  }
> 
>  /* Like gomp_barrier_wait, except that if the encountering thread
> @@ -64,11 +128,29 @@ gomp_barrier_wait (gomp_barrier_t *bar)
>     the barrier can be safely destroyed.  */
> 
>  void
> -gomp_barrier_wait_last (gomp_barrier_t *bar)
> +gomp_barrier_wait_last (gomp_barrier_t *bar, unsigned id)
>  {
> -  gomp_barrier_state_t state = gomp_barrier_wait_start (bar);
> +  gomp_assert (id != 0, "Thread with id %u called
> gomp_barrier_wait_last", id);
> +  gomp_barrier_wait_start (bar, id);
> +  /* N.b. For the intended use of this function we don't actually
> need to do
> +     the below.  `state & BAR_WAS_LAST` should never be non-zero.
> The
> +     assertion above checks that `id != 0` and that's the only case
> when this
> +     could be true.  However we write down what the code should be
> if/when
> +     `gomp_barrier_wait_last` gets used in a different way.
> +
> +     If we do want to provide the possibility for non-primary threads
> to know
> +     that this barrier can be destroyed we would need that non-
> primary thread
> +     to wait on the barrier (as it would having called
> gomp_barrier_wait`, and
> +     the primary thread to signal to it that all other threads have
> arrived.
> +     That would be done with the below (having gotten `state` from
> the return
> +     value of `gomp_barrier_wait_start` call above):
> +
>    if (state & BAR_WAS_LAST)
> -    gomp_barrier_wait_end (bar, state);
> +    {
> +      gomp_barrier_ensure_last (bar, id, state);
> +      gomp_barrier_wait_end (bar, state, id);
> +    }
> +     */
>  }
> 
>  void
> @@ -77,24 +159,231 @@ gomp_team_barrier_wake (gomp_barrier_t *bar, int
> count)
>    futex_wake ((int *) &bar->generation, count == 0 ? INT_MAX :
> count);
>  }
> 
> +/* We need to have some "ensure we're last, but perform useful work
> +   in the meantime" function.  This allows the primary thread to
> perform useful
> +   work while it is waiting on all secondary threads.
> +   - Again related to the fundamental difference between this barrier
> and the
> +     "centralized" barrier where the thread doing the bookkeeping is
> +     pre-determined as some "primary" thread and that is not
> necessarily the
> +     last thread to enter the barrier.
> +
> +   To do that we loop through checking each of the other threads
> flags.  If any
> +   are not set then we take that opportunity to check the the global
> generation
> +   number, if there's some task to handle then do so before going
> back to
> +   checking the remaining thread-local flags.
> +
> +   This approach means that the cancellable barriers work reasonably
> naturally.
> +   If we're checking the global generation flag then we can see when
> it is
> +   cancelled.  Hence `gomp_team_barrier_ensure_cancel_last` below
> does
> +   something very similar to this function here.
> +
> +   The "wait for wakeup" is a little tricky.  When there is nothing
> for a
> +   thread to do we usually call `futex_wait` on an address.  In this
> case we
> +   want to wait on one of two addresses being changed.  In Linux
> kernels >=
> +   5.16 there is the `futex_waitv` syscall which provides us exactly
> that
> +   functionality, but in older kernels we have to implement some
> mechanism to
> +   emulate the functionality.
> +
> +   On these older kernels (as a fallback implementation) we do the
> following:
> +   - Primary fetch_add's the PRIMARY_WAITING_TG bit to the thread-
> local
> +     generation number of the secondary it's waiting on.
> +     - If primary sees that the BAR_INCR was added then secondary
> reached
> +       barrier as we decided to go to sleep waiting for it.
> +       Clear the flag and continue.
> +     - Otherwise primary futex_wait's on the generation number.
> +       - Past this point the only state that this primary thread will
> use
> +        to indicate that this particular thread has arrived will be
> +        BAR_SECONDARY_ARRIVED set on the barrier-global generation
> number.
> +        (the increment of the thread-local generation number no
> longer means
> +        anything to the primary thread, though we do assert that it's
> done
> +        in a checking build since that is still an invariant that
> must hold
> +        for later uses of this barrier).
> +       - Can get woken up by a new task being added
> (BAR_TASK_PENDING).
> +       - Can get woken up by BAR_SECONDARY_ARRIVED bit flag in the
> global
> +        generation number saying "secondary you were waiting on has
> +        arrived".
> +        - In this case it clears that bit and returns to looking to
> see if
> +          all secondary threads have arrived.
> +   Meanwhile secondary thread:
> +   - Checks the value it gets in `gomp_assert_and_increment_flag`.
> +     - If it has the PRIMARY_WAITING_TG flag then set
> +       BAR_SECONDARY_ARRIVED on global generation number and
> futex_wake
> +       it.
> +   Each secondary thread has to ignore BAR_SECONDARY_ARRIVED in their
> loop
> +   (will still get woken up by it, just continue around the loop
> until it's
> +   cleared).
> +
> +   N.b. using this fallback mechanism is the only place where any
> thread
> +   modifies a thread-local generation number that is not its own.
> This is also
> +   the only place where a secondary thread would modify bar-
> >generation without
> +   a lock held.  Hence these modifications need a reasonable amount
> of care
> +   w.r.t.  atomicity.  */
>  void
> -gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
> state)
> +gomp_team_barrier_ensure_last (gomp_barrier_t *bar, unsigned id,
> +                              gomp_barrier_state_t state)
> +{
> +  /* Thought process here:
> +     - Until we are the last thread, we are "some thread waiting on
> the
> +       barrier".  Hence we should doing only a variant of what is
> done in
> +       `gomp_team_barrier_wait_end` for the other threads.
> +     - The loop done in that function is:
> +       1) Wait on `generation` having some flag.
> +       2) When it changes, enforce the acquire-release semantics and
> +         re-load.
> +       3) If added `BAR_TASK_PENDING`, then handle some tasks.
> +       4) Ignore `BAR_WAITING_FOR_TASK` flag.
> +     - Hence the loop done in this function is:
> +       1) Look through each thread-local generation number.
> +         - When this is not incremented, then:
> +           1. wait on `generation` having some flag.
> +           2. When it does have such a flag, enforce the acquire-
> release
> +              semantics and load.
> +           3. If added `BAR_TASK_PENDING` then handle some tasks.
> +           4. Ignore `BAR_WAITING_FOR_TASK` not necessary because
> it's us that
> +              would set that flag.
> +       Differences otherwise are that we put pausing in different
> positions.  */
> +  gomp_assert (id == 0, "Calling ensure_last in thread %u\n", id);
> +  unsigned gstate = state & (BAR_BOTH_GENS_MASK | BAR_CANCELLED);
> +  unsigned tstate = state & BAR_GEN_MASK;
> +  struct thread_lock_data *arr = bar->threadgens;
> +  for (unsigned i = 1; i < bar->total; i++)
> +    {
> +      unsigned long long j, count = spin_count ();
> +
> +    wait_on_this_thread:
> +      /* Use `j <= count` here just in case `gomp_spin_count_var ==
> 0` (which
> +        can happen with OMP_WAIT_POLICY=passive).  We need to go
> around this
> +        loop at least once to check and handle any changes.  */
> +      for (j = 0; j <= count; j++)
> +       {
> +         /* Thought about using MEMMODEL_ACQUIRE or MEMMODEL_RELAXED
> until we
> +            see a difference and *then* MEMMODEL_ACQUIRE for the
> +            acquire-release semantics.  Idea was that the more
> relaxed memory
> +            model might provide a performance boost.  Did not see any
> +            improvement on a micro-benchmark and decided not to do
> that.
> +
> +            TODO Question whether to run one loop checking both
> variables, or
> +                 two loops each checking one variable multiple times.
> Suspect
> +                 that one loop checking both variables is going to be
> more
> +                 responsive and more understandable in terms of
> performance
> +                 characteristics when specifying GOMP_SPINCOUNT.
> +                 However there's the chance that checking each
> variable
> +                 something like 10000 times and going around this
> loop
> +                 count/10000 times could give better throughput?  */
> +         unsigned int threadgen
> +           = __atomic_load_n (&arr[i].gen, MEMMODEL_ACQUIRE);
> +         /* If this thread local generation number has
> PRIMARY_WAITING_TG set,
> +            then we must have set it (primary thread is the only one
> that sets
> +            this flag).  The mechanism by which that would be set is:
> +            1) We dropped into `futex_waitv` while waiting on this
> thread
> +            2) Some secondary thread woke us (either by setting
> +               `BAR_TASK_PENDING` or the thread we were waiting on
> setting
> +               `BAR_SECONDARY_ARRIVED`).
> +            3) We came out of `futex_wake` and started waiting on
> this thread
> +               again.
> +            Go into the `bar->generation` clause below to reset state
> around
> +            here and we'll go back to this thing later.  N.b. having
> +            `PRIMARY_WAITING_TG` set on a given thread local
> generation flag
> +            means that the "continue" flag for this thread is now
> seeing
> +            `BAR_SECONDARY_ARRIVED` set on the global generation
> flag.  */
> +         if (__builtin_expect (threadgen != tstate, 0)
> +             && __builtin_expect (!(threadgen & PRIMARY_WAITING_TG),
> 1))
> +           {
> +             gomp_assert (threadgen == (tstate + BAR_INCR),
> +                          "Thread-local state seen to be %u"
> +                          " when global state was %u.\n",
> +                          threadgen, tstate);
> +             goto wait_on_next_thread;
> +           }
> +         /* Only need for MEMMODEL_ACQUIRE below is when using the
> futex_waitv
> +            fallback for older kernels.  When this happens we use the
> +            BAR_SECONDARY_ARRIVED flag for synchronisation need to
> ensure the
> +            acquire-release synchronisation is formed from secondary
> thread to
> +            primary thread as per OpenMP flush requirements on a
> barrier.  */
> +         unsigned int gen
> +           = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
> +         if (__builtin_expect (gen != gstate, 0))
> +           {
> +             /* If there are some tasks to perform then perform them.
> */
> +             if (gen & BAR_TASK_PENDING)
> +               gomp_barrier_handle_tasks (gstate, false);
> +             if (gen & BAR_SECONDARY_ARRIVED)
> +               {
> +                 /* Secondary thread has arrived, clear the flag on
> the
> +                    thread local generation and the global
> generation.
> +                    Invariants are:
> +                    - BAR_SECONDARY_ARRIVED is only ever set by the
> secondary
> +                      threads and only when their own thread-local
> generation
> +                      has the flag PRIMARY_WAITING_TG.
> +                    - PRIMARY_WAITING_TG flag is only ever set by the
> primary
> +                      thread.
> +                    - Release-Acquire ordering on the generation
> number means
> +                      that as far as this thread is concerned the
> relevant
> +                      secondary thread must have incremented its
> generation.
> +                    - When using the fallback the required "flush"
> release
> +                      semantics are provided by the Release-Acquire
> +                      syncronisation on the generation number (with
> +                      BAR_SECONDARY_ARRIVED). Otherwise it's provided
> by the
> +                      Release-Acquire ordering on the thread-local
> generation.
> +                 */
> +                 __atomic_fetch_and (&bar->generation,
> ~BAR_SECONDARY_ARRIVED,
> +                                     MEMMODEL_RELAXED);
> +                 gomp_assert (
> +                   (threadgen == ((tstate + BAR_INCR) |
> PRIMARY_WAITING_TG))
> +                     || (threadgen == (tstate | PRIMARY_WAITING_TG)),
> +                   "Thread %d local generation is %d but expected"
> +                   " PRIMARY_WAITING_TG set because bar->generation"
> +                   " marked with SECONDARY (%d)",
> +                   i, threadgen, gen);
> +                 __atomic_fetch_and (&arr[i].gen,
> ~PRIMARY_WAITING_TG,
> +                                     MEMMODEL_RELAXED);
> +               }
> +             /* There should be no other way in which this can be
> different. */
> +             gomp_assert (
> +               (gen & BAR_CANCELLED) == (gstate & BAR_CANCELLED),
> +               "Unnecessary looping due to mismatching BAR_CANCELLED"
> +               " bar->generation: %u   state: %u",
> +               gen, gstate);
> +             gomp_assert (
> +               !(gen & BAR_WAITING_FOR_TASK),
> +               "BAR_WAITING_FOR_TASK set by non-primary thread
> gen=%d", gen);
> +             gomp_assert (
> +               !gomp_barrier_state_is_incremented (gen, gstate,
> BAR_INCR),
> +               "Global state incremented by non-primary thread
> gen=%d", gen);
> +             goto wait_on_this_thread;
> +           }
> +       }
> +      /* Neither thread-local generation number nor global generation
> seem to
> +        be changing.  Wait for one of them to change.  */
> +      futex_waitv ((int *) &arr[i].gen, tstate, (int *) &bar-
> >generation,
> +                  gstate);
> +      /* One of the above values has changed, go back to the start of
> this loop
> +       * and we can find out what it was and deal with it
> accordingly.  */
> +      goto wait_on_this_thread;
> +
> +    wait_on_next_thread:
> +      continue;
> +    }
> +  gomp_assert_seenflags (bar, false);
> +}
> +
> +void
> +gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
> state,
> +                           unsigned id)
>  {
>    unsigned int generation, gen;
> 
>    if (__builtin_expect (state & BAR_WAS_LAST, 0))
>      {
> -      /* Next time we'll be awaiting TOTAL threads again.  */
> +      gomp_assert (id == 0, "Id %u believes it is last\n", id);
>        struct gomp_thread *thr = gomp_thread ();
>        struct gomp_team *team = thr->ts.team;
> -
> -      bar->awaited = bar->total;
>        team->work_share_cancelled = 0;
>        unsigned task_count
>         = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
>        if (__builtin_expect (task_count, 0))
>         {
> -         gomp_barrier_handle_tasks (state);
> +         gomp_barrier_handle_tasks (state, false);
>           state &= ~BAR_WAS_LAST;
>         }
>        else
> @@ -113,103 +402,376 @@ gomp_team_barrier_wait_end (gomp_barrier_t
> *bar, gomp_barrier_state_t state)
>      {
>        do_wait ((int *) &bar->generation, generation);
>        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
> +      gomp_assert ((gen == state + BAR_INCR)
> +                    || (gen & BAR_CANCELLED) == (generation &
> BAR_CANCELLED)
> +                    /* Can cancel a barrier when already gotten into
> final
> +                       implicit barrier at the end of a parallel
> loop.
> +                       This happens in `cancel-parallel-3.c`.
> +                       In this case the above assertion does not hold
> because
> +                       We are waiting on the implicit barrier at the
> end of a
> +                       parallel region while some other thread is
> performing
> +                       work in that parallel region, hits a
> +                       `#pragma omp cancel parallel`, and sets said
> flag.  */
> +                    || !(generation & BAR_CANCELLED),
> +                  "Unnecessary looping due to BAR_CANCELLED diff"
> +                  " gen: %u  generation: %u  id: %u",
> +                  gen, generation, id);
>        if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
>         {
> -         gomp_barrier_handle_tasks (state);
> +         gomp_barrier_handle_tasks (state, false);
>           gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
>         }
> +      /* These flags will not change until this barrier is completed.
> +        Going forward we don't want to be continually waking up
> checking for
> +        whether this barrier has completed yet.
> +
> +        If the barrier is cancelled but there are tasks yet to
> perform then
> +        some thread will have used `gomp_barrier_handle_tasks` to go
> through
> +        all tasks and drop them.  */
>        generation |= gen & BAR_WAITING_FOR_TASK;
> +      generation |= gen & BAR_CANCELLED;
> +      /* Other flags that may be set in `bar->generation` are:
> +        1) BAR_SECONDARY_ARRIVED
> +        2) BAR_SECONDARY_CANCELLABLE_ARRIVED
> +        While we want to ignore these, they should be transient and
> quickly
> +        removed, hence we don't adjust our expected `generation`
> accordingly.
> +        TODO Would be good to benchmark both approaches.  */
>      }
> -  while (!gomp_barrier_state_is_incremented (gen, state));
> +  while (!gomp_barrier_state_is_incremented (gen, state, false));
>  }
> 
>  void
> -gomp_team_barrier_wait (gomp_barrier_t *bar)
> +gomp_team_barrier_wait (gomp_barrier_t *bar, unsigned id)
>  {
> -  gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar));
> +  gomp_barrier_state_t state = gomp_barrier_wait_start (bar, id);
> +  if (__builtin_expect (state & BAR_WAS_LAST, 0))
> +    gomp_team_barrier_ensure_last (bar, id, state);
> +  gomp_team_barrier_wait_end (bar, state, id);
>  }
> 
>  void
> -gomp_team_barrier_wait_final (gomp_barrier_t *bar)
> +gomp_team_barrier_wait_final (gomp_barrier_t *bar, unsigned id)
>  {
> -  gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar);
> -  if (__builtin_expect (state & BAR_WAS_LAST, 0))
> -    bar->awaited_final = bar->total;
> -  gomp_team_barrier_wait_end (bar, state);
> +  return gomp_team_barrier_wait (bar, id);
> +}
> +
> +void
> +gomp_assert_and_increment_cancel_flag (gomp_barrier_t *bar, unsigned
> id,
> +                                      unsigned gens)
> +{
> +  struct thread_lock_data *arr = bar->threadgens;
> +  /* Because we have separate thread local generation numbers for
> cancellable
> +     barriers and non-cancellable barriers, we can use fetch_add in
> the
> +     cancellable thread-local generation number and just let the bits
> roll over
> +     into the BAR_INCR area.  This means we don't have to have a CAS
> loop to
> +     handle the futex_waitv_fallback possibility that the primary
> thread
> +     updates our thread-local variable at the same time as we do.  */
> +  unsigned orig
> +    = __atomic_fetch_add (&arr[id].cgen, BAR_CANCEL_INCR,
> MEMMODEL_RELEASE);
> +  futex_wake ((int *) &arr[id].cgen, INT_MAX);
> +
> +  /* However, using that fetch_add means that we need to mask out the
> values
> +     we're comparing against.  Since this masking is not a memory
> operation we
> +     believe that trade-off is good.  */
> +  unsigned orig_cgen = orig & (BAR_CANCEL_GEN_MASK | BAR_FLAGS_MASK);
> +  unsigned global_cgen = gens & BAR_CANCEL_GEN_MASK;
> +  if (__builtin_expect (orig_cgen == (global_cgen |
> PRIMARY_WAITING_TG), 0))
> +    {
> +      unsigned prev = __atomic_fetch_or (&bar->generation,
> +
> BAR_SECONDARY_CANCELLABLE_ARRIVED,
> +                                        MEMMODEL_RELAXED);
> +      /* Wait! The barrier got cancelled, this flag is nothing but
> annoying
> +        state to ignore in the non-cancellable barrier that will be
> coming up
> +        soon and is state that needs to be reset before any following
> +        cancellable barrier is called.  */
> +      if (prev & BAR_CANCELLED)
> +       __atomic_fetch_and (&bar->generation,
> +                           ~BAR_SECONDARY_CANCELLABLE_ARRIVED,
> +                           MEMMODEL_RELAXED);
> +      futex_wake ((int *) &bar->generation, INT_MAX);
> +    }
> +  else
> +    {
> +      gomp_assert (orig_cgen == global_cgen,
> +                  "Id %u:  Original flag %u != generation of %u\n",
> id,
> +                  orig_cgen, global_cgen);
> +    }
> +}
> +
> +bool
> +gomp_team_barrier_ensure_cancel_last (gomp_barrier_t *bar, unsigned
> id,
> +                                     gomp_barrier_state_t state)
> +{
> +  gomp_assert (id == 0, "Calling ensure_cancel_last in thread %u\n",
> id);
> +  unsigned gstate = state & BAR_BOTH_GENS_MASK;
> +  unsigned tstate = state & BAR_CANCEL_GEN_MASK;
> +  struct thread_lock_data *arr = bar->threadgens;
> +  for (unsigned i = 1; i < bar->total; i++)
> +    {
> +      unsigned long long j, count = spin_count ();
> +
> +    wait_on_this_thread:
> +      for (j = 0; j <= count; j++)
> +       {
> +         unsigned int threadgen
> +           = __atomic_load_n (&arr[i].cgen, MEMMODEL_ACQUIRE);
> +         /* Clear "overrun" bits -- spillover into the non-
> cancellable
> +            generation numbers that we are leaving around in order to
> avoid
> +            having to perform extra memory operations in this
> barrier.   */
> +         threadgen &= (BAR_CANCEL_GEN_MASK | BAR_FLAGS_MASK);
> +
> +         if (__builtin_expect (threadgen != tstate, 0)
> +             && __builtin_expect (!(threadgen & PRIMARY_WAITING_TG),
> 1))
> +           {
> +             gomp_assert (threadgen == BAR_INCREMENT_CANCEL (tstate),
> +                          "Thread-local state seen to be %u"
> +                          " when global state was %u.\n",
> +                          threadgen, tstate);
> +             goto wait_on_next_thread;
> +           }
> +         unsigned int gen
> +           = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
> +         if (__builtin_expect (gen != gstate, 0))
> +           {
> +             gomp_assert (
> +               !(gen & BAR_WAITING_FOR_TASK),
> +               "BAR_WAITING_FOR_TASK set in non-primary thread
> gen=%d", gen);
> +             gomp_assert (!gomp_barrier_state_is_incremented (gen,
> gstate,
> +
> BAR_CANCEL_INCR),
> +                          "Global state incremented by non-primary
> thread "
> +                          "gstate=%d gen=%d",
> +                          gstate, gen);
> +             if (gen & BAR_CANCELLED)
> +               {
> +                 if (__builtin_expect (threadgen &
> PRIMARY_WAITING_TG, 0))
> +                   __atomic_fetch_and (&arr[i].cgen,
> ~PRIMARY_WAITING_TG,
> +                                       MEMMODEL_RELAXED);
> +                 /* Don't check for BAR_SECONDARY_CANCELLABLE_ARRIVED
> here.
> +                    There are too many windows for race conditions if
> +                    resetting here (essentially can't guarantee that
> we'll
> +                    catch it so appearing like we might is just
> confusing).
> +                    Instead the primary thread has to ignore that
> flag in the
> +                    next barrier (which will be a non-cancellable
> barrier),
> +                    and we'll eventually reset it either in the
> thread that
> +                    set `BAR_CANCELLED` or in the thread that set
> +                    `BAR_SECONDARY_CANCELLABLE_ARRIVED`.  */
> +                 return false;
> +               }
> +             if (gen & BAR_SECONDARY_CANCELLABLE_ARRIVED)
> +               {
> +                 __atomic_fetch_and (&bar->generation,
> +
> ~BAR_SECONDARY_CANCELLABLE_ARRIVED,
> +                                     MEMMODEL_RELAXED);
> +                 gomp_assert (
> +                   (threadgen
> +                    == (BAR_INCREMENT_CANCEL (tstate) |
> PRIMARY_WAITING_TG))
> +                     || (threadgen == (tstate | PRIMARY_WAITING_TG)),
> +                   "Thread %d local generation is %d but expected"
> +                   " PRIMARY_WAITING_TG set because bar->generation"
> +                   " marked with SECONDARY (%d)",
> +                   i, threadgen, gen);
> +                 __atomic_fetch_and (&arr[i].cgen,
> ~PRIMARY_WAITING_TG,
> +                                     MEMMODEL_RELAXED);
> +               }
> +
> +             if (gen & BAR_TASK_PENDING)
> +               gomp_barrier_handle_tasks (gstate, false);
> +             goto wait_on_this_thread;
> +           }
> +       }
> +      futex_waitv ((int *) &arr[i].cgen, tstate, (int *) &bar-
> >generation,
> +                  gstate);
> +      goto wait_on_this_thread;
> +
> +    wait_on_next_thread:
> +      continue;
> +    }
> +  gomp_assert_seenflags (bar, true);
> +  return true;
>  }
> 
>  bool
>  gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
> -                                  gomp_barrier_state_t state)
> +                                  gomp_barrier_state_t state,
> unsigned id)
>  {
>    unsigned int generation, gen;
> +  gomp_assert (
> +    !(state & BAR_CANCELLED),
> +    "gomp_team_barrier_wait_cancel_end called when barrier cancelled
> state: %u",
> +    state);
> 
>    if (__builtin_expect (state & BAR_WAS_LAST, 0))
>      {
> -      /* Next time we'll be awaiting TOTAL threads again.  */
> -      /* BAR_CANCELLED should never be set in state here, because
> -        cancellation means that at least one of the threads has been
> -        cancelled, thus on a cancellable barrier we should never see
> -        all threads to arrive.  */
>        struct gomp_thread *thr = gomp_thread ();
>        struct gomp_team *team = thr->ts.team;
> -
> -      bar->awaited = bar->total;
>        team->work_share_cancelled = 0;
>        unsigned task_count
>         = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
>        if (__builtin_expect (task_count, 0))
>         {
> -         gomp_barrier_handle_tasks (state);
> +         gomp_barrier_handle_tasks (state, true);
>           state &= ~BAR_WAS_LAST;
>         }
>        else
>         {
> -         state += BAR_INCR - BAR_WAS_LAST;
> +         state &= ~BAR_WAS_LAST;
> +         state = BAR_INCREMENT_CANCEL (state);
>           __atomic_store_n (&bar->generation, state,
> MEMMODEL_RELEASE);
>           futex_wake ((int *) &bar->generation, INT_MAX);
>           return false;
>         }
>      }
> 
> -  if (__builtin_expect (state & BAR_CANCELLED, 0))
> -    return true;
> -
>    generation = state;
>    do
>      {
>        do_wait ((int *) &bar->generation, generation);
>        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
>        if (__builtin_expect (gen & BAR_CANCELLED, 0))
> -       return true;
> +       {
> +         if (__builtin_expect ((gen & BAR_CANCEL_GEN_MASK)
> +                                 != (state & BAR_CANCEL_GEN_MASK),
> +                               0))
> +           {
> +             /* Have cancelled a barrier just after completing the
> current
> +                one.  We must not reset our local state.  */
> +             gomp_assert (
> +               (gen & BAR_CANCEL_GEN_MASK)
> +                 == (BAR_INCREMENT_CANCEL (state) &
> BAR_CANCEL_GEN_MASK),
> +               "Incremented global generation (cancellable) more than
> one"
> +               " gen: %u  original state: %u",
> +               gen, state);
> +             /* Other threads could have continued on in between the
> time that
> +                the primary thread signalled all other threads should
> wake up
> +                and the time that we actually read `bar->generation`
> above.
> +                Any one of them could be performing tasks, also they
> could be
> +                waiting on the next barrier and could have set
> +                BAR_SECONDARY{_CANCELLABLE,}_ARRIVED.
> +
> +                W.r.t. the task handling that could have been set
> (primary
> +                thread increments generation telling us to go, then
> it
> +                continues itself and finds an `omp task` directive
> and
> +                schedules a task) we need to move on with it
> (PR122314).  */
> +             gomp_assert (!(gen & BAR_WAITING_FOR_TASK),
> +                          "Generation incremented while "
> +                          " main thread is still waiting for tasks:
> gen: %u  "
> +                          "original state: %u",
> +                          gen, state);
> +             /* *This* barrier wasn't cancelled -- the next barrier
> is
> +                cancelled.  Returning `false` gives that information
> back up
> +                to calling functions.  */
> +             return false;
> +           }
> +         /* Need to reset our thread-local generation.  Don't want
> state to be
> +            messed up the next time we hit a cancellable barrier.
> +            Must do this atomically because the cancellation signal
> could
> +            happen from "somewhere else" while the primary thread has
> decided
> +            that it wants to wait on us -- if we're using the
> fallback for
> +            pre-5.16 Linux kernels.
> +
> +            We are helped by the invariant that when this barrier is
> +            cancelled, the next barrier that will be entered is a
> +            non-cancellable barrier.  That means we don't have to
> worry about
> +            the primary thread getting confused by this generation
> being
> +            incremented to say it's reached a cancellable barrier.
> +
> +            We do not reset the PRIMARY_WAITING_TG bit.  That is left
> to the
> +            primary thread.  We only subtract the BAR_CANCEL_INCR
> that we
> +            added before getting here.
> +
> +            N.b. Another approach to avoid problems with multiple
> threads
> +            modifying this thread-local generation could be for the
> primary
> +            thread to reset the thread-local generations once it's
> "gathered"
> +            all threads during the next non-cancellable barrier.
> Such a reset
> +            would not need to be atomic because we would know that
> all threads
> +            have already acted on their thread-local generation
> number.
> +
> +            That approach mirrors the previous approach where the
> primary
> +            thread would reset `awaited`.  The problem with this is
> that now
> +            we have *many* generations to reset, the primary thread
> can be the
> +            bottleneck in a barrier (it performs the most work) and
> putting
> +            more work into the primary thread while all secondary
> threads are
> +            waiting on it seems problematic.  Moreover outside of the
> +            futex_waitv_fallback the primary thread does not adjust
> the
> +            thread-local generations.  Maintaining that property
> where
> +            possible seems very worthwhile.  */
> +         unsigned orig __attribute__ ((unused))
> +         = __atomic_fetch_sub (&bar->threadgens[id].cgen,
> BAR_CANCEL_INCR,
> +                               MEMMODEL_RELAXED);
> +#if _LIBGOMP_CHECKING_
> +         unsigned orig_gen = (orig & BAR_CANCEL_GEN_MASK);
> +         unsigned global_gen_plus1
> +           = ((gen + BAR_CANCEL_INCR) & BAR_CANCEL_GEN_MASK);
> +         gomp_assert (orig_gen == global_gen_plus1
> +                        || orig_gen == (global_gen_plus1 |
> PRIMARY_WAITING_TG),
> +                      "Thread-local generation %u unknown
> modification:"
> +                      " expected %u (with possible PRIMARY* flags)
> seen %u",
> +                      id, (gen + BAR_CANCEL_INCR) &
> BAR_CANCEL_GEN_MASK, orig);
> +#endif
> +         return true;
> +       }
>        if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
>         {
> -         gomp_barrier_handle_tasks (state);
> +         gomp_barrier_handle_tasks (state, true);
>           gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
>         }
>        generation |= gen & BAR_WAITING_FOR_TASK;
>      }
> -  while (!gomp_barrier_state_is_incremented (gen, state));
> +  while (!gomp_barrier_state_is_incremented (gen, state, true));
> 
>    return false;
>  }
> 
>  bool
> -gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
> +gomp_team_barrier_wait_cancel (gomp_barrier_t *bar, unsigned id)
>  {
> -  return gomp_team_barrier_wait_cancel_end (bar,
> gomp_barrier_wait_start (bar));
> +  gomp_barrier_state_t state = gomp_barrier_wait_cancel_start (bar,
> id);
> +
> +  if (__builtin_expect (state & BAR_CANCELLED, 0))
> +    return true;
> +
> +  if (__builtin_expect (state & BAR_WAS_LAST, 0)
> +      && !gomp_team_barrier_ensure_cancel_last (bar, id, state))
> +    {
> +      gomp_reset_cancellable_primary_threadgen (bar, id);
> +      return true;
> +    }
> +  return gomp_team_barrier_wait_cancel_end (bar, state, id);
>  }
> 
>  void
>  gomp_team_barrier_cancel (struct gomp_team *team)
>  {
> -  gomp_mutex_lock (&team->task_lock);
> -  if (team->barrier.generation & BAR_CANCELLED)
> -    {
> -      gomp_mutex_unlock (&team->task_lock);
> -      return;
> -    }
> -  team->barrier.generation |= BAR_CANCELLED;
> -  gomp_mutex_unlock (&team->task_lock);
> -  futex_wake ((int *) &team->barrier.generation, INT_MAX);
> +  /* Always set CANCEL on the barrier.  Means we have one simple
> atomic
> +     operation.  Need an atomic operation because the barrier now
> uses the
> +     `generation` to communicate.
> +
> +     Don't need to add any memory-ordering here.  The only thing that
> +     BAR_CANCELLED signifies is that the barrier is cancelled and the
> only
> +     thing that BAR_SECONDARY_CANCELLABLE_ARRIVED signifies is that
> the
> +     secondary thread has arrived at the barrier.  No thread infers
> anything
> +     about any other data having been set based on these flags.  */
> +  unsigned orig = __atomic_fetch_or (&team->barrier.generation,
> BAR_CANCELLED,
> +                                    MEMMODEL_RELAXED);
> +  /* We're cancelling a barrier and it's currently using the fallback
> mechanism
> +     instead of the futex_waitv syscall.  We need to ensure that
> state gets
> +     reset before the next cancellable barrier.
> +
> +     Any cancelled barrier is followed by a non-cancellable barrier
> so just
> +     ensuring this reset happens in one thread allows us to ensure
> that it
> +     happens before any thread reaches the next cancellable barrier.
> +
> +     This and the check in `gomp_assert_and_increment_cancel_flag`
> are the two
> +     ways in which we ensure `BAR_SECONDARY_CANCELLABLE_ARRIVED` is
> reset by
> +     the time that any thread arrives at the next cancellable
> barrier.
> +
> +     We need both since that `BAR_SECONDARY_CANCELLABLE_ARRIVED` flag
> could be
> +     set *just* after this `BAR_CANCELLED` bit gets set (in which
> case the
> +     other case handles resetting) or before (in which case this
> clause handles
> +     resetting).  */
> +  if (orig & BAR_SECONDARY_CANCELLABLE_ARRIVED)
> +    __atomic_fetch_and (&team->barrier.generation,
> +                       ~BAR_SECONDARY_CANCELLABLE_ARRIVED,
> MEMMODEL_RELAXED);
> +  if (!(orig & BAR_CANCELLED))
> +    futex_wake ((int *) &team->barrier.generation, INT_MAX);
>  }
> diff --git a/libgomp/config/linux/bar.h b/libgomp/config/linux/bar.h
> index faa03746d8f..dbd4b868418 100644
> --- a/libgomp/config/linux/bar.h
> +++ b/libgomp/config/linux/bar.h
> @@ -32,14 +32,31 @@
> 
>  #include "mutex.h"
> 
> +/* Handy to have `cgen` and `gen` separate, since then we can use
> +   `__atomic_fetch_add` directly on the `cgen` instead of having to
> use a CAS
> +   loop to increment the cancel generation bits.  If it weren't for
> the
> +   fallback for when `futex_waitv` is not available it wouldn't
> matter but with
> +   that fallback more than one thread can adjust these thread-local
> generation
> +   numbers and hence we have to be concerned about synchronisation
> issues.  */
> +struct __attribute__ ((aligned (64))) thread_lock_data
> +{
> +  unsigned gen;
> +  unsigned cgen;
> +};
> +
>  typedef struct
>  {
>    /* Make sure total/generation is in a mostly read cacheline, while
> -     awaited in a separate cacheline.  */
> +     awaited in a separate cacheline.  Each generation structure is
> in a
> +     separate cache line too.  Put both cancellable and non-
> cancellable
> +     generation numbers in the same cache line because they should
> both be
> +     only ever modified by their corresponding thread (except in the
> case of
> +     the primary thread wanting to wait on a given thread arriving at
> the
> +     barrier and we're on an old Linux kernel).  */
>    unsigned total __attribute__((aligned (64)));
> +  unsigned allocated;
>    unsigned generation;
> -  unsigned awaited __attribute__((aligned (64)));
> -  unsigned awaited_final;
> +  struct thread_lock_data *threadgens;
>  } gomp_barrier_t;
> 
>  typedef unsigned int gomp_barrier_state_t;
> @@ -48,140 +65,416 @@ typedef unsigned int gomp_barrier_state_t;
>     low bits dedicated to flags.  Note that TASK_PENDING and WAS_LAST
> can
>     share space because WAS_LAST is never stored back to generation.
> */
>  #define BAR_TASK_PENDING       1
> +/* In this particular target BAR_WAS_LAST indicates something more
> like
> +   "chosen by design to be last", but I like having the macro the
> same name as
> +   it is given in other targets.  */
>  #define BAR_WAS_LAST           1
>  #define BAR_WAITING_FOR_TASK   2
>  #define BAR_CANCELLED          4
> -#define BAR_INCR               8
> +/* BAR_SECONDARY_ARRIVED and PRIMARY_WAITING_TG flags are only used
> for the
> +   fallback approach when `futex_waitv` is not available.  That
> syscall should
> +   be available on all kernels newer than Linux 5.16.  */
> +#define BAR_SECONDARY_ARRIVED 8
> +#define BAR_SECONDARY_CANCELLABLE_ARRIVED 16
> +/* Using bits 5 -> 10 for the generation number of cancellable
> barriers and
> +   remaining bits for the generation number of non-cancellable
> barriers.  */
> +#define BAR_CANCEL_INCR 32
> +#define BAR_INCR 2048
> +#define BAR_FLAGS_MASK (~(-BAR_CANCEL_INCR))
> +#define BAR_GEN_MASK (-BAR_INCR)
> +#define BAR_BOTH_GENS_MASK (-BAR_CANCEL_INCR)
> +#define BAR_CANCEL_GEN_MASK (-BAR_CANCEL_INCR & (~(-BAR_INCR)))
> +/* Increment BAR_CANCEL_INCR, with wrapping arithmetic within the
> bits assigned
> +   to this generation number.  I.e. Increment, then set bits above
> BAR_INCR to
> +   what they were before.  */
> +#define BAR_INCREMENT_CANCEL(X)
> \
> +  ({
> \
> +    __typeof__ (X) _X = (X);
> \
> +    (((_X + BAR_CANCEL_INCR) & BAR_CANCEL_GEN_MASK)
> \
> +     | (_X & ~BAR_CANCEL_GEN_MASK));
> \
> +  })
> +/* The thread-local generation field similarly contains a counter in
> the high
> +   bits and has a few low bits dedicated to flags.  None of the flags
> above are
> +   used in the thread-local generation field.  Hence we can have a
> different
> +   set of bits for a protocol between the primary thread and the
> secondary
> +   threads.  */
> +#define PRIMARY_WAITING_TG 1
> 
> -static inline void gomp_barrier_init (gomp_barrier_t *bar, unsigned
> count)
> +static inline void
> +gomp_assert_seenflags (gomp_barrier_t *bar, bool cancellable)
>  {
> +#if _LIBGOMP_CHECKING_
> +  unsigned gen = __atomic_load_n (&bar->generation,
> MEMMODEL_RELAXED);
> +  struct thread_lock_data *arr = bar->threadgens;
> +  unsigned cancel_incr = cancellable ? BAR_CANCEL_INCR : 0;
> +  unsigned incr = cancellable ? 0 : BAR_INCR;
> +  /* Assert that all threads have been seen.  */
> +  for (unsigned i = 0; i < bar->total; i++)
> +    {
> +      gomp_assert (arr[i].gen == (gen & BAR_GEN_MASK) + incr,
> +                  "Index %u generation is %u (global is %u)\n", i,
> arr[i].gen,
> +                  gen);
> +      gomp_assert ((arr[i].cgen & BAR_CANCEL_GEN_MASK)
> +                    == ((gen + cancel_incr) & BAR_CANCEL_GEN_MASK),
> +                  "Index %u cancel generation is %u (global is
> %u)\n", i,
> +                  arr[i].cgen, gen);
> +    }
> +
> +  /* Assert that generation numbers not corresponding to any thread
> are
> +     cleared.  This helps us test code-paths.  */
> +  for (unsigned i = bar->total; i < bar->allocated; i++)
> +    {
> +      gomp_assert (arr[i].gen == 0,
> +                  "Index %u gen should be 0.  Is %u (global gen is
> %u)\n", i,
> +                  arr[i].gen, gen);
> +      gomp_assert (arr[i].cgen == 0,
> +                  "Index %u gen should be 0.  Is %u (global gen is
> %u)\n", i,
> +                  arr[i].cgen, gen);
> +    }
> +#endif
> +}
> +
> +static inline void
> +gomp_barrier_init (gomp_barrier_t *bar, unsigned count)
> +{
> +  bar->threadgens
> +    = gomp_aligned_alloc (64, sizeof (bar->threadgens[0]) * count);
> +  for (unsigned i = 0; i < count; ++i)
> +    {
> +      bar->threadgens[i].gen = 0;
> +      bar->threadgens[i].cgen = 0;
> +    }
>    bar->total = count;
> -  bar->awaited = count;
> -  bar->awaited_final = count;
> +  bar->allocated = count;
>    bar->generation = 0;
>  }
> 
> -static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned
> count)
> +static inline bool
> +gomp_barrier_has_space (gomp_barrier_t *bar, unsigned nthreads)
>  {
> -  __atomic_add_fetch (&bar->awaited, count - bar->total,
> MEMMODEL_ACQ_REL);
> -  bar->total = count;
> +  return nthreads <= bar->allocated;
> +}
> +
> +static inline void
> +gomp_barrier_minimal_reinit (gomp_barrier_t *bar, unsigned nthreads,
> +                            unsigned num_new_threads)
> +{
> +  /* Just increasing number of threads by appending logically used
> threads at
> +     "the end" of the team.  That essentially means we need more of
> the
> +     `bar->threadgens` array to be logically used.  We set them all
> to the
> +     current `generation` (marking that they are yet to hit this
> generation).
> +
> +     This function has only been called after we checked there is
> enough space
> +     in this barrier for the number of threads we want using it.
> Hence there's
> +     no serialisation needed.  */
> +  gomp_assert (nthreads <= bar->allocated,
> +              "minimal reinit on barrier with not enough space: "
> +              "%u > %u",
> +              nthreads, bar->allocated);
> +  unsigned gen = bar->generation & BAR_GEN_MASK;
> +  unsigned cancel_gen = bar->generation & BAR_CANCEL_GEN_MASK;
> +  gomp_assert (bar->total == nthreads - num_new_threads,
> +              "minimal_reinit called with incorrect state: %u != %u -
> %u\n",
> +              bar->total, nthreads, num_new_threads);
> +  for (unsigned i = bar->total; i < nthreads; i++)
> +    {
> +      bar->threadgens[i].gen = gen;
> +      bar->threadgens[i].cgen = cancel_gen;
> +    }
> +  bar->total = nthreads;
> +}
> +
> +/* When re-initialising a barrier we know the following:
> +   1) We are waiting on a non-cancellable barrier.
> +   2) The cancel generation bits are known consistent (having been
> tidied up by
> +      each individual thread if the barrier got cancelled).  */
> +static inline void
> +gomp_barrier_reinit_1 (gomp_barrier_t *bar, unsigned nthreads,
> +                      unsigned num_new_threads, unsigned long long
> *new_ids)
> +{
> +#if _LIBGOMP_CHECKING_
> +  /* Assertions that this barrier is in a sensible state.
> +     Everything waiting on the standard barrier.
> +     Current thread has not registered itself as arrived, but we
> tweak for the
> +     current assertions.  */
> +  bar->threadgens[0].gen += BAR_INCR;
> +  gomp_assert_seenflags (bar, false);
> +  bar->threadgens[0].gen -= BAR_INCR;
> +  struct thread_lock_data threadgen_zero = bar->threadgens[0];
> +#endif
> +  if (!gomp_barrier_has_space (bar, nthreads))
> +    {
> +      /* Using `realloc` not chosen.  Pros/Cons below.
> +        Pros of using `realloc`:
> +        - May not actually have to move memory.
> +        Cons of using `realloc`:
> +        - If do have to move memory, then *also* copies data, we are
> going to
> +          overwrite the data in this function.  That copy would be a
> waste.
> +        - If do have to move memory then pointer may no longer be
> aligned.
> +          Would need bookkeeping for "pointer to free" and "pointer
> to have
> +          data".
> +        Seems like "bad" case of `realloc` is made even worse by what
> we need
> +        here.  Would have to benchmark to figure out whether using
> `realloc`
> +        or not is best.  Since we shouldn't be re-allocating very
> often I'm
> +        choosing the simplest to code rather than the most optimal.
> +
> +        Does not matter that we have any existing threads waiting on
> this
> +        barrier.  They are all waiting on bar->generation and their
> +        thread-local generation will not be looked at.  */
> +      gomp_aligned_free (bar->threadgens);
> +      bar->threadgens
> +       = gomp_aligned_alloc (64, sizeof (bar->threadgens[0]) *
> nthreads);
> +      bar->allocated = nthreads;
> +    }
> +
> +  /* Re-initialise the existing values.  */
> +  unsigned curgen = bar->generation & BAR_GEN_MASK;
> +  unsigned cancel_curgen = bar->generation & BAR_CANCEL_GEN_MASK;
> +  unsigned iter_len = nthreads;
> +  unsigned bits_per_ull = sizeof (unsigned long long) * CHAR_BIT;
> +#if _LIBGOMP_CHECKING_
> +  /* If checking, zero out everything that's not going to be used in
> this team.
> +     This is only helpful for debugging (other assertions later can
> ensure that
> +     we've gone through this path for adjusting the number of
> threads, and when
> +     viewing the data structure in GDB can easily identify which
> generation
> +     numbers are in use).  When not running assertions or running in
> the
> +     debugger these extra numbers are simply not used.  */
> +  iter_len = bar->allocated;
> +  /* In the checking build just unconditionally reinitialise.  This
> handles
> +     when the memory has moved and is harmless (except in performance
> which the
> +     checking build doesn't care about) otherwise.  */
> +  bar->threadgens[0] = threadgen_zero;
> +#endif
> +  for (unsigned i = 1; i < iter_len; i++)
> +    {
> +      /* Re-initialisation.  Zero out the "remaining" elements in our
> wake flag
> +        array when _LIBGOMP_CHECKING_ as a helper for our assertions
> to check
> +        validity.  Set thread-specific generations to "seen" for `i's
> +        corresponding to re-used threads, set thread-specific
> generations to
> +        "not yet seen" for `i's corresponding to threads about to be
> +        spawned.  */
> +      unsigned newthr_val = i < nthreads ? curgen : 0;
> +      unsigned newthr_cancel_val = i < nthreads ? cancel_curgen : 0;
> +      unsigned index = i / bits_per_ull;
> +      unsigned long long bitmask = (1ULL << (i % bits_per_ull));
> +      bool bit_is_set = ((new_ids[index] & bitmask) != 0);
> +      bar->threadgens[i].gen = bit_is_set ? curgen + BAR_INCR :
> newthr_val;
> +      /* This is different because we only ever call this function
> while threads
> +        are waiting on a non-cancellable barrier.  Hence "which
> threads have
> +        arrived and which will be newly spawned" is not a question.
> */
> +      bar->threadgens[i].cgen = newthr_cancel_val;
> +    }
> +  bar->total = nthreads;
>  }
> 
>  static inline void gomp_barrier_destroy (gomp_barrier_t *bar)
>  {
> +  gomp_aligned_free (bar->threadgens);
>  }
> 
> -extern void gomp_barrier_wait (gomp_barrier_t *);
> -extern void gomp_barrier_wait_last (gomp_barrier_t *);
> -extern void gomp_barrier_wait_end (gomp_barrier_t *,
> gomp_barrier_state_t);
> -extern void gomp_team_barrier_wait (gomp_barrier_t *);
> -extern void gomp_team_barrier_wait_final (gomp_barrier_t *);
> -extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
> -                                       gomp_barrier_state_t);
> -extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *);
> +static inline void
> +gomp_barrier_reinit_2 (gomp_barrier_t __attribute__ ((unused)) * bar,
> +                      unsigned __attribute__ ((unused)) nthreads) {};
> +extern void gomp_barrier_wait (gomp_barrier_t *, unsigned);
> +extern void gomp_barrier_wait_last (gomp_barrier_t *, unsigned);
> +extern void gomp_barrier_wait_end (gomp_barrier_t *,
> gomp_barrier_state_t,
> +                                  unsigned);
> +extern void gomp_team_barrier_wait (gomp_barrier_t *, unsigned);
> +extern void gomp_team_barrier_wait_final (gomp_barrier_t *,
> unsigned);
> +extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
> gomp_barrier_state_t,
> +                                       unsigned);
> +extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *,
> unsigned);
>  extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
> -                                              gomp_barrier_state_t);
> +                                              gomp_barrier_state_t,
> unsigned);
>  extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
>  struct gomp_team;
>  extern void gomp_team_barrier_cancel (struct gomp_team *);
> +extern void gomp_team_barrier_ensure_last (gomp_barrier_t *,
> unsigned,
> +                                          gomp_barrier_state_t);
> +extern void gomp_barrier_ensure_last (gomp_barrier_t *, unsigned,
> +                                     gomp_barrier_state_t);
> +extern bool gomp_team_barrier_ensure_cancel_last (gomp_barrier_t *,
> unsigned,
> +
> gomp_barrier_state_t);
> +extern void gomp_assert_and_increment_flag (gomp_barrier_t *,
> unsigned,
> +                                           unsigned);
> +extern void gomp_assert_and_increment_cancel_flag (gomp_barrier_t *,
> unsigned,
> +                                                  unsigned);
> 
>  static inline gomp_barrier_state_t
> -gomp_barrier_wait_start (gomp_barrier_t *bar)
> +gomp_barrier_wait_start (gomp_barrier_t *bar, unsigned id)
>  {
> +  /* TODO I don't believe this MEMMODEL_ACQUIRE is needed.
> +     Look into it later.  Point being that this should only ever read
> a value
> +     from last barrier or from tasks/cancellation/etc.  There was
> already an
> +     acquire-release ordering at exit of the last barrier, all
> setting of
> +     tasks/cancellation etc are done with RELAXED memory model =>
> using ACQUIRE
> +     doesn't help.
> +
> +     See corresponding comment in `gomp_team_barrier_cancel` when
> thinking
> +     about this.  */
>    unsigned int ret = __atomic_load_n (&bar->generation,
> MEMMODEL_ACQUIRE);
> -  ret &= -BAR_INCR | BAR_CANCELLED;
> -  /* A memory barrier is needed before exiting from the various forms
> -     of gomp_barrier_wait, to satisfy OpenMP API version 3.1 section
> -     2.8.6 flush Construct, which says there is an implicit flush
> during
> -     a barrier region.  This is a convenient place to add the
> barrier,
> -     so we use MEMMODEL_ACQ_REL here rather than MEMMODEL_ACQUIRE.
> */
> -  if (__atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0)
> +  ret &= (BAR_BOTH_GENS_MASK | BAR_CANCELLED);
> +#if !_LIBGOMP_CHECKING_
> +  if (id != 0)
> +#endif
> +    /* Increment local flag.  For thread id 0 this doesn't
> communicate
> +       anything to *other* threads, but it is useful for debugging
> purposes.  */
> +    gomp_assert_and_increment_flag (bar, id, ret);
> +
> +  if (id == 0)
>      ret |= BAR_WAS_LAST;
>    return ret;
>  }
> 
>  static inline gomp_barrier_state_t
> -gomp_barrier_wait_cancel_start (gomp_barrier_t *bar)
> -{
> -  return gomp_barrier_wait_start (bar);
> -}
> -
> -/* This is like gomp_barrier_wait_start, except it decrements
> -   bar->awaited_final rather than bar->awaited and should be used
> -   for the gomp_team_end barrier only.  */
> -static inline gomp_barrier_state_t
> -gomp_barrier_wait_final_start (gomp_barrier_t *bar)
> +gomp_barrier_wait_cancel_start (gomp_barrier_t *bar, unsigned id)
>  {
>    unsigned int ret = __atomic_load_n (&bar->generation,
> MEMMODEL_ACQUIRE);
> -  ret &= -BAR_INCR | BAR_CANCELLED;
> -  /* See above gomp_barrier_wait_start comment.  */
> -  if (__atomic_add_fetch (&bar->awaited_final, -1, MEMMODEL_ACQ_REL)
> == 0)
> +  ret &= BAR_BOTH_GENS_MASK | BAR_CANCELLED;
> +  if (!(ret & BAR_CANCELLED)
> +#if !_LIBGOMP_CHECKING_
> +      && id != 0
> +#endif
> +  )
> +    gomp_assert_and_increment_cancel_flag (bar, id, ret);
> +  if (id == 0)
>      ret |= BAR_WAS_LAST;
>    return ret;
>  }
> 
> +static inline void
> +gomp_reset_cancellable_primary_threadgen (gomp_barrier_t *bar,
> unsigned id)
> +{
> +#if _LIBGOMP_CHECKING_
> +  gomp_assert (id == 0,
> +              "gomp_reset_cancellable_primary_threadgen called with "
> +              "non-primary thread id: %u",
> +              id);
> +  unsigned orig = __atomic_fetch_sub (&bar->threadgens[id].cgen,
> +                                     BAR_CANCEL_INCR,
> MEMMODEL_RELAXED);
> +  unsigned orig_gen = (orig & BAR_CANCEL_GEN_MASK);
> +  unsigned gen = __atomic_load_n (&bar->generation,
> MEMMODEL_RELAXED);
> +  unsigned global_gen_plus1 = ((gen + BAR_CANCEL_INCR) &
> BAR_CANCEL_GEN_MASK);
> +  gomp_assert (orig_gen == global_gen_plus1,
> +              "Thread-local generation %u unknown modification:"
> +              " expected %u seen %u",
> +              id, (gen + BAR_CANCEL_GEN_MASK) & BAR_CANCEL_GEN_MASK,
> orig);
> +#endif
> +}
> +
>  static inline bool
>  gomp_barrier_last_thread (gomp_barrier_state_t state)
>  {
>    return state & BAR_WAS_LAST;
>  }
> 
> -/* All the inlines below must be called with team->task_lock
> -   held.  */
> +/* All the inlines below must be called with team->task_lock held.
> However
> +   with the `futex_waitv` fallback there can still be contention on
> +   `bar->generation`.  For the RMW operations it is obvious that we
> need to
> +   perform these atomically.  For the load in
> `gomp_barrier_has_completed` and
> +   `gomp_team_barrier_cancelled` the need is tied to the C
> specification.
> +
> +   On the architectures I have some grips on (x86_64 and AArch64)
> there would
> +   be no real downside to setting a bit in thread while reading in
> another.
> +   However the C definition of data race doesn't have any such leeway
> and to
> +   avoid UB we need to load atomically.  */
> 
>  static inline void
>  gomp_team_barrier_set_task_pending (gomp_barrier_t *bar)
>  {
> -  bar->generation |= BAR_TASK_PENDING;
> +  __atomic_fetch_or (&bar->generation, BAR_TASK_PENDING,
> MEMMODEL_RELAXED);
>  }
> 
>  static inline void
>  gomp_team_barrier_clear_task_pending (gomp_barrier_t *bar)
>  {
> -  bar->generation &= ~BAR_TASK_PENDING;
> +  __atomic_fetch_and (&bar->generation, ~BAR_TASK_PENDING,
> MEMMODEL_RELAXED);
>  }
> 
>  static inline void
>  gomp_team_barrier_set_waiting_for_tasks (gomp_barrier_t *bar)
>  {
> -  bar->generation |= BAR_WAITING_FOR_TASK;
> +  __atomic_fetch_or (&bar->generation, BAR_WAITING_FOR_TASK,
> MEMMODEL_RELAXED);
>  }
> 
>  static inline bool
>  gomp_team_barrier_waiting_for_tasks (gomp_barrier_t *bar)
>  {
> -  return (bar->generation & BAR_WAITING_FOR_TASK) != 0;
> +  unsigned gen = __atomic_load_n (&bar->generation,
> MEMMODEL_RELAXED);
> +  return (gen & BAR_WAITING_FOR_TASK) != 0;
>  }
> 
>  static inline bool
>  gomp_team_barrier_cancelled (gomp_barrier_t *bar)
>  {
> -  return __builtin_expect ((bar->generation & BAR_CANCELLED) != 0,
> 0);
> +  unsigned gen = __atomic_load_n (&bar->generation,
> MEMMODEL_RELAXED);
> +  return __builtin_expect ((gen & BAR_CANCELLED) != 0, 0);
> +}
> +
> +static inline unsigned
> +gomp_increment_gen (gomp_barrier_state_t state, bool use_cancel)
> +{
> +  unsigned gens = (state & BAR_BOTH_GENS_MASK);
> +  return use_cancel ? BAR_INCREMENT_CANCEL (gens) : gens + BAR_INCR;
>  }
> 
>  static inline void
> -gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t
> state)
> +gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t
> state,
> +                       bool use_cancel)
>  {
>    /* Need the atomic store for acquire-release synchronisation with
> the
>       load in `gomp_team_barrier_wait_{cancel_,}end`.  See PR112356
> */
> -  __atomic_store_n (&bar->generation, (state & -BAR_INCR) + BAR_INCR,
> -                   MEMMODEL_RELEASE);
> +  unsigned next = gomp_increment_gen (state, use_cancel);
> +  __atomic_store_n (&bar->generation, next, MEMMODEL_RELEASE);
>  }
> 
>  static inline bool
>  gomp_barrier_state_is_incremented (gomp_barrier_state_t gen,
> -                                  gomp_barrier_state_t state)
> +                                  gomp_barrier_state_t state, bool
> use_cancel)
>  {
> -  unsigned next_state = (state & -BAR_INCR) + BAR_INCR;
> -  return next_state > state ? gen >= next_state : gen < state;
> +  unsigned next = gomp_increment_gen (state, use_cancel);
> +  return next > state ? gen >= next : gen < state;
>  }
> 
>  static inline bool
> -gomp_barrier_has_completed (gomp_barrier_state_t state,
> gomp_barrier_t *bar)
> +gomp_barrier_has_completed (gomp_barrier_state_t state,
> gomp_barrier_t *bar,
> +                           bool use_cancel)
>  {
>    /* Handling overflow in the generation.  The "next" state could be
> less than
>       or greater than the current one.  */
> -  return gomp_barrier_state_is_incremented (bar->generation, state);
> +  unsigned curgen = __atomic_load_n (&bar->generation,
> MEMMODEL_RELAXED);
> +  return gomp_barrier_state_is_incremented (curgen, state,
> use_cancel);
> +}
> +
> +static inline void
> +gomp_barrier_prepare_reinit (gomp_barrier_t *bar, unsigned id)
> +{
> +  gomp_assert (id == 0,
> +              "gomp_barrier_prepare_reinit called in non-primary
> thread: %u",
> +              id);
> +  /* This use of `gomp_barrier_wait_start` is worth note.
> +     1) We're running in `id == 0`, which means that without checking
> we'll
> +       essentially just load `bar->generation`.
> +     2) In this case there's no need to form any release-acquire
> ordering.  The
> +       `gomp_barrier_ensure_last` call below will form a release-
> acquire
> +       ordering between each secondary thread and this one, and that
> will be
> +       from some point after all uses of the barrier that we care
> about.
> +     3) However, in the checking builds, it's very useful to call
> +       `gomp_assert_and_increment_flag` in order to provide extra
> guarantees
> +       about what we're doing.  */
> +  gomp_barrier_state_t state = gomp_barrier_wait_start (bar, id);
> +  gomp_barrier_ensure_last (bar, id, state);
> +#if _LIBGOMP_CHECKING_
> +  /* When checking, `gomp_assert_and_increment_flag` will have
> incremented the
> +     generation flag.  However later on down the line we'll be
> calling the full
> +     barrier again and we need to decrement that flag ready for that.
> We still
> +     *want* the flag to have been incremented above so that the
> assertions in
> +     `gomp_barrier_ensure_last` all work.
> +
> +     When not checking, this increment/decrement/increment again
> cycle is not
> +     performed.  */
> +  bar->threadgens[0].gen -= BAR_INCR;
> +#endif
>  }
> 
>  #endif /* GOMP_BARRIER_H */
> diff --git a/libgomp/config/linux/futex_waitv.h
> b/libgomp/config/linux/futex_waitv.h
> new file mode 100644
> index 00000000000..c9780b5e9f0
> --- /dev/null
> +++ b/libgomp/config/linux/futex_waitv.h
> @@ -0,0 +1,129 @@
> +/* Copyright The GNU Toolchain Authors.
> +
> +   This file is part of the GNU Offloading and Multi Processing
> Library
> +   (libgomp).
> +
> +   Libgomp is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your
> option)
> +   any later version.
> +
> +   Libgomp is distributed in the hope that it will be useful, but
> WITHOUT ANY
> +   WARRANTY; without even the implied warranty of MERCHANTABILITY or
> FITNESS
> +   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> +   more details.
> +
> +   Under Section 7 of GPL version 3, you are granted additional
> +   permissions described in the GCC Runtime Library Exception,
> version
> +   3.1, as published by the Free Software Foundation.
> +
> +   You should have received a copy of the GNU General Public License
> and
> +   a copy of the GCC Runtime Library Exception along with this
> program;
> +   see the files COPYING3 and COPYING.RUNTIME respectively.  If not,
> see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Only defining an interface that we need.  `waitv` can take many
> more
> +   addresses but we only use two.  We define a fallback for when
> `futex_waitv`
> +   is not available.  This rather than define another target in the
> config/
> +   directory that looks much the same as the linux one except for the
> parts
> +   handling `futex_waitv`, `PRIMARY_WAITING_TG`, and
> +   `BAR_SECONDARY_{CANCELLABLE_,}ARRIVED`.  */
> +
> +#define _GNU_SOURCE
> +#include <sys/syscall.h>
> +
> +#ifdef SYS_futex_waitv
> +#pragma GCC visibility push(default)
> +
> +#include <unistd.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <time.h>
> +
> +#pragma GCC visibility pop
> +
> +struct futex_waitv
> +{
> +  uint64_t val;
> +  uint64_t uaddr;
> +  uint32_t flags;
> +  uint32_t __reserved;
> +};
> +
> +static inline void
> +futex_waitv (int *addr, int val, int *addr2, int val2)
> +{
> +  gomp_assert (gomp_thread ()->ts.team_id == 0,
> +              "Called futex_waitv from secondary thread %d\n",
> +              gomp_thread ()->ts.team_id);
> +  struct futex_waitv addrs[2];
> +  addrs[0].val = val;
> +  addrs[0].uaddr = (uint64_t) (uintptr_t) addr;
> +  /* Using same internally-defined flags as futex.h does.  These are
> defined in
> +     wait.h.  */
> +  addrs[0].flags = FUTEX_PRIVATE_FLAG | FUTEX_32;
> +  addrs[0].__reserved = 0;
> +  addrs[1].val = val2;
> +  addrs[1].uaddr = (uint64_t) (uintptr_t) addr2;
> +  addrs[1].flags = FUTEX_PRIVATE_FLAG | FUTEX_32;
> +  addrs[1].__reserved = 0;
> +  int err __attribute__ ((unused))
> +  = syscall (SYS_futex_waitv, addrs, 2, 0, NULL, CLOCK_MONOTONIC);
> +  /* If a signal woke us then we simply leave and let the loop
> outside of us
> +     handle it.  We never require knowledge about whether anything
> changed or
> +     not.  */
> +  gomp_assert (err >= 0 || errno == EAGAIN || errno == EINTR,
> +              "Failed with futex_waitv err = %d, message: %s", err,
> +              strerror (errno));
> +}
> +
> +#else
> +
> +static inline void
> +futex_waitv (int *addr, int val, int *addr2, int val2)
> +{
> +  int threadlocal
> +    = __atomic_fetch_or (addr, PRIMARY_WAITING_TG, MEMMODEL_RELAXED);
> +  /* futex_wait can be woken up because of a BAR_TASK_PENDING being
> set or
> +     the like.  In that case we might come back here after checking
> +     variables again.
> +     If in between checking variables and coming back here the other
> thread
> +     arrived this variable could have been incremented.
> +     Hence possible variables are:
> +     - val
> +     - val + BAR_INCR
> +     - val | PRIMARY_WAITING_TG
> +     - val + BAR_INCR | PRIMARY_WAITING_TG
> +
> +     If the `PRIMARY_WAITING_TG` flag is set, then the trigger for
> "we can
> +     proceed" is now `BAR_SECONDARY_ARRIVED` being set on the
> generation
> +     number.  */
> +  if (__builtin_expect (threadlocal != val, 0)
> +      && !(threadlocal & PRIMARY_WAITING_TG))
> +    {
> +      /* Secondary thread reached this point before us.  Know that
> secondary
> +        will not modify this variable again until we've "released" it
> from
> +        this barrier.  Hence can simply reset the thread-local
> variable and
> +        continue.
> +
> +        It's worth mentioning this implementation interacts directly
> with what
> +        is handled in bar.c.  That's not a great separation of
> concerns.
> +        I believe I need things that way, but would be nice if I
> could make
> +        the separation neat.  ??? That also might allow passing some
> +        information down about whether we're working on the
> cancellable or
> +        non-cancellable generation numbers.  Then would be able to
> restrict
> +        the below assertion to the only value that is valid (for
> whichever set
> +        of generation numbers we have).  */
> +      gomp_assert (threadlocal == (val + BAR_INCR)
> +                    || ((threadlocal & BAR_CANCEL_GEN_MASK)
> +                        == (BAR_INCREMENT_CANCEL (val) &
> BAR_CANCEL_GEN_MASK)),
> +                  "threadlocal generation number odd:  %d  (expected
> %d)",
> +                  threadlocal, val);
> +      __atomic_store_n (addr, threadlocal, MEMMODEL_RELAXED);
> +      return;
> +    }
> +  futex_wait (addr2, val2);
> +}
> +
> +#endif
> diff --git a/libgomp/config/linux/wait.h b/libgomp/config/linux/wait.h
> index c15e035dd56..c197da5ea24 100644
> --- a/libgomp/config/linux/wait.h
> +++ b/libgomp/config/linux/wait.h
> @@ -35,6 +35,8 @@
> 
>  #define FUTEX_WAIT     0
>  #define FUTEX_WAKE     1
> +
> +#define FUTEX_32 2
>  #define FUTEX_PRIVATE_FLAG     128
> 
>  #ifdef HAVE_ATTRIBUTE_VISIBILITY
> @@ -45,14 +47,21 @@ extern int gomp_futex_wait, gomp_futex_wake;
> 
>  #include <futex.h>
> 
> -static inline int do_spin (int *addr, int val)
> +static inline unsigned long long
> +spin_count ()
>  {
> -  unsigned long long i, count = gomp_spin_count_var;
> -
> +  unsigned long long count = gomp_spin_count_var;
>    if (__builtin_expect (__atomic_load_n (&gomp_managed_threads,
>                                           MEMMODEL_RELAXED)
>                          > gomp_available_cpus, 0))
>      count = gomp_throttled_spin_count_var;
> +  return count;
> +}
> +
> +static inline int
> +do_spin (int *addr, int val)
> +{
> +  unsigned long long i, count = spin_count ();
>    for (i = 0; i < count; i++)
>      if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_RELAXED) !=
> val, 0))
>        return 0;
> diff --git a/libgomp/config/posix/bar.c b/libgomp/config/posix/bar.c
> index a86b2f38c2d..6c8a4c6d7d2 100644
> --- a/libgomp/config/posix/bar.c
> +++ b/libgomp/config/posix/bar.c
> @@ -105,13 +105,14 @@ gomp_barrier_wait_end (gomp_barrier_t *bar,
> gomp_barrier_state_t state)
>  }
> 
>  void
> -gomp_barrier_wait (gomp_barrier_t *barrier)
> +gomp_barrier_wait (gomp_barrier_t *barrier, unsigned id)
>  {
> -  gomp_barrier_wait_end (barrier, gomp_barrier_wait_start (barrier));
> +  gomp_barrier_wait_end (barrier, gomp_barrier_wait_start (barrier,
> id));
>  }
> 
>  void
> -gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
> state)
> +gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t
> state,
> +                           unsigned id __attribute__ ((unused)))
>  {
>    unsigned int n;
> 
> @@ -127,7 +128,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar,
> gomp_barrier_state_t state)
>         = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
>        if (task_count)
>         {
> -         gomp_barrier_handle_tasks (state);
> +         gomp_barrier_handle_tasks (state, false);
>           if (n > 0)
>             gomp_sem_wait (&bar->sem2);
>           gomp_mutex_unlock (&bar->mutex1);
> @@ -154,7 +155,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar,
> gomp_barrier_state_t state)
>           gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
>           if (gen & BAR_TASK_PENDING)
>             {
> -             gomp_barrier_handle_tasks (state);
> +             gomp_barrier_handle_tasks (state, false);
>               gen = __atomic_load_n (&bar->generation,
> MEMMODEL_ACQUIRE);
>             }
>         }
> @@ -175,7 +176,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar,
> gomp_barrier_state_t state)
> 
>  bool
>  gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
> -                                  gomp_barrier_state_t state)
> +                                  gomp_barrier_state_t state,
> +                                  unsigned id __attribute__
> ((unused)))
>  {
>    unsigned int n;
> 
> @@ -191,7 +193,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t
> *bar,
>         = __atomic_load_n (&team->task_count, MEMMODEL_ACQUIRE);
>        if (task_count)
>         {
> -         gomp_barrier_handle_tasks (state);
> +         gomp_barrier_handle_tasks (state, true);
>           if (n > 0)
>             gomp_sem_wait (&bar->sem2);
>           gomp_mutex_unlock (&bar->mutex1);
> @@ -226,7 +228,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t
> *bar,
>             break;
>           if (gen & BAR_TASK_PENDING)
>             {
> -             gomp_barrier_handle_tasks (state);
> +             gomp_barrier_handle_tasks (state, true);
>               gen = __atomic_load_n (&bar->generation,
> MEMMODEL_ACQUIRE);
>               if (gen & BAR_CANCELLED)
>                 break;
> @@ -251,9 +253,10 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t
> *bar,
>  }
> 
>  void
> -gomp_team_barrier_wait (gomp_barrier_t *barrier)
> +gomp_team_barrier_wait (gomp_barrier_t *barrier, unsigned id)
>  {
> -  gomp_team_barrier_wait_end (barrier, gomp_barrier_wait_start
> (barrier));
> +  gomp_team_barrier_wait_end (barrier, gomp_barrier_wait_start
> (barrier, id),
> +                             id);
>  }
> 
>  void
> @@ -266,10 +269,10 @@ gomp_team_barrier_wake (gomp_barrier_t *bar, int
> count)
>  }
> 
>  bool
> -gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
> +gomp_team_barrier_wait_cancel (gomp_barrier_t *bar, unsigned id)
>  {
> -  gomp_barrier_state_t state = gomp_barrier_wait_cancel_start (bar);
> -  return gomp_team_barrier_wait_cancel_end (bar, state);
> +  gomp_barrier_state_t state = gomp_barrier_wait_cancel_start (bar,
> id);
> +  return gomp_team_barrier_wait_cancel_end (bar, state, id);
>  }
> 
>  void
> diff --git a/libgomp/config/posix/bar.h b/libgomp/config/posix/bar.h
> index 35b94e43ce2..928b12a14ff 100644
> --- a/libgomp/config/posix/bar.h
> +++ b/libgomp/config/posix/bar.h
> @@ -62,20 +62,21 @@ extern void gomp_barrier_init (gomp_barrier_t *,
> unsigned);
>  extern void gomp_barrier_reinit (gomp_barrier_t *, unsigned);
>  extern void gomp_barrier_destroy (gomp_barrier_t *);
> 
> -extern void gomp_barrier_wait (gomp_barrier_t *);
> +extern void gomp_barrier_wait (gomp_barrier_t *, unsigned);
>  extern void gomp_barrier_wait_end (gomp_barrier_t *,
> gomp_barrier_state_t);
> -extern void gomp_team_barrier_wait (gomp_barrier_t *);
> -extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
> -                                       gomp_barrier_state_t);
> -extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *);
> +extern void gomp_team_barrier_wait (gomp_barrier_t *, unsigned);
> +extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
> gomp_barrier_state_t,
> +                                       unsigned);
> +extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *,
> unsigned);
>  extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
> -                                              gomp_barrier_state_t);
> +                                              gomp_barrier_state_t,
> unsigned);
>  extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
>  struct gomp_team;
>  extern void gomp_team_barrier_cancel (struct gomp_team *);
> 
>  static inline gomp_barrier_state_t
> -gomp_barrier_wait_start (gomp_barrier_t *bar)
> +gomp_barrier_wait_start (gomp_barrier_t *bar,
> +                        unsigned id __attribute__ ((unused)))
>  {
>    unsigned int ret;
>    gomp_mutex_lock (&bar->mutex1);
> @@ -86,7 +87,8 @@ gomp_barrier_wait_start (gomp_barrier_t *bar)
>  }
> 
>  static inline gomp_barrier_state_t
> -gomp_barrier_wait_cancel_start (gomp_barrier_t *bar)
> +gomp_barrier_wait_cancel_start (gomp_barrier_t *bar,
> +                               unsigned id __attribute__ ((unused)))
>  {
>    unsigned int ret;
>    gomp_mutex_lock (&bar->mutex1);
> @@ -99,9 +101,9 @@ gomp_barrier_wait_cancel_start (gomp_barrier_t
> *bar)
>  }
> 
>  static inline void
> -gomp_team_barrier_wait_final (gomp_barrier_t *bar)
> +gomp_team_barrier_wait_final (gomp_barrier_t *bar, unsigned id)
>  {
> -  gomp_team_barrier_wait (bar);
> +  gomp_team_barrier_wait (bar, id);
>  }
> 
>  static inline bool
> @@ -111,9 +113,9 @@ gomp_barrier_last_thread (gomp_barrier_state_t
> state)
>  }
> 
>  static inline void
> -gomp_barrier_wait_last (gomp_barrier_t *bar)
> +gomp_barrier_wait_last (gomp_barrier_t *bar, unsigned id)
>  {
> -  gomp_barrier_wait (bar);
> +  gomp_barrier_wait (bar, id);
>  }
> 
>  /* All the inlines below must be called with team->task_lock
> @@ -150,7 +152,8 @@ gomp_team_barrier_cancelled (gomp_barrier_t *bar)
>  }
> 
>  static inline void
> -gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t
> state)
> +gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t
> state,
> +                       unsigned use_cancel __attribute__ ((unused)))
>  {
>    /* Need the atomic store for acquire-release synchronisation with
> the
>       load in `gomp_team_barrier_wait_{cancel_,}end`.  See PR112356
> */
> @@ -167,11 +170,81 @@ gomp_barrier_state_is_incremented
> (gomp_barrier_state_t gen,
>  }
> 
>  static inline bool
> -gomp_barrier_has_completed (gomp_barrier_state_t state,
> gomp_barrier_t *bar)
> +gomp_barrier_has_completed (gomp_barrier_state_t state,
> gomp_barrier_t *bar,
> +                           bool use_cancel __attribute__ ((unused)))
>  {
>    /* Handling overflow in the generation.  The "next" state could be
> less than
>       or greater than the current one.  */
>    return gomp_barrier_state_is_incremented (bar->generation, state);
>  }
> 
> +/* Functions dummied out for this implementation.  */
> +static inline void
> +gomp_barrier_prepare_reinit (gomp_barrier_t *bar __attribute__
> ((unused)),
> +                            unsigned id __attribute__ ((unused)))
> +{}
> +
> +static inline void
> +gomp_barrier_minimal_reinit (gomp_barrier_t *bar, unsigned nthreads,
> +                            unsigned num_new_threads __attribute__
> ((unused)))
> +{
> +  gomp_barrier_reinit (bar, nthreads);
> +}
> +
> +static inline void
> +gomp_barrier_reinit_1 (gomp_barrier_t *bar, unsigned nthreads,
> +                      unsigned num_new_threads,
> +                      unsigned long long *new_ids __attribute__
> ((unused)))
> +{
> +  if (num_new_threads)
> +    {
> +      gomp_mutex_lock (&bar->mutex1);
> +      bar->total += num_new_threads;
> +      gomp_mutex_unlock (&bar->mutex1);
> +    }
> +}
> +
> +static inline void
> +gomp_barrier_reinit_2 (gomp_barrier_t *bar, unsigned nthreads)
> +{
> +  gomp_barrier_reinit (bar, nthreads);
> +}
> +
> +static inline bool
> +gomp_barrier_has_space (gomp_barrier_t *bar __attribute__ ((unused)),
> +                       unsigned nthreads __attribute__ ((unused)))
> +{
> +  /* Space to handle `nthreads`.  Only thing that we need is to set
> bar->total
> +     to `nthreads`.  Can always do that.  */
> +  return true;
> +}
> +
> +static inline void
> +gomp_team_barrier_ensure_last (gomp_barrier_t *bar __attribute__
> ((unused)),
> +                              unsigned id __attribute__ ((unused)),
> +                              gomp_barrier_state_t state
> +                              __attribute__ ((unused)))
> +{}
> +
> +static inline bool
> +gomp_team_barrier_ensure_cancel_last (gomp_barrier_t *bar
> +                                     __attribute__ ((unused)),
> +                                     unsigned id __attribute__
> ((unused)),
> +                                     gomp_barrier_state_t state
> +                                     __attribute__ ((unused)))
> +{
> +  /* After returning BAR_WAS_LAST, actually ensure that this thread
> is last.
> +     Return `true` if this thread is known last into the barrier
> return `false`
> +     if the barrier got cancelled such that not all threads entered
> the barrier.
> +
> +     Since BAR_WAS_LAST is only set for a thread when that thread
> decremented
> +     the `awaited` counter to zero we know that all threads must have
> entered
> +     the barrier.  Hence always return `true`.  */
> +  return true;
> +}
> +
> +static inline void
> +gomp_reset_cancellable_primary_threadgen (gomp_barrier_t *bar,
> unsigned id)
> +{}
> +
>  #endif /* GOMP_BARRIER_H */
> diff --git a/libgomp/config/posix/simple-bar.h
> b/libgomp/config/posix/simple-bar.h
> index 7b4b7e43ea6..12abd0512e8 100644
> --- a/libgomp/config/posix/simple-bar.h
> +++ b/libgomp/config/posix/simple-bar.h
> @@ -43,9 +43,18 @@ gomp_simple_barrier_init (gomp_simple_barrier_t
> *bar, unsigned count)
>  }
> 
>  static inline void
> -gomp_simple_barrier_reinit (gomp_simple_barrier_t *bar, unsigned
> count)
> +gomp_simple_barrier_minimal_reinit (gomp_simple_barrier_t *bar,
> +                                   unsigned nthreads, unsigned
> num_new_threads)
>  {
> -  gomp_barrier_reinit (&bar->bar, count);
> +  gomp_barrier_minimal_reinit (&bar->bar, nthreads, num_new_threads);
> +}
> +
> +static inline void
> +gomp_simple_barrier_reinit_1 (gomp_simple_barrier_t *bar, unsigned
> nthreads,
> +                             unsigned num_new_threads,
> +                             unsigned long long *new_ids)
> +{
> +  gomp_barrier_reinit_1 (&bar->bar, nthreads, num_new_threads,
> new_ids);
>  }
> 
>  static inline void
> @@ -55,15 +64,33 @@ gomp_simple_barrier_destroy (gomp_simple_barrier_t
> *bar)
>  }
> 
>  static inline void
> -gomp_simple_barrier_wait (gomp_simple_barrier_t *bar)
> +gomp_simple_barrier_wait (gomp_simple_barrier_t *bar, unsigned id)
> +{
> +  gomp_barrier_wait (&bar->bar, id);
> +}
> +
> +static inline void
> +gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar, unsigned
> id)
>  {
> -  gomp_barrier_wait (&bar->bar);
> +  gomp_barrier_wait_last (&bar->bar, id);
>  }
> 
>  static inline void
> -gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar)
> +gomp_simple_barrier_prepare_reinit (gomp_simple_barrier_t *sbar,
> unsigned id)
> +{
> +  gomp_barrier_prepare_reinit (&sbar->bar, id);
> +}
> +
> +static inline void
> +gomp_simple_barrier_reinit_2 (gomp_simple_barrier_t *sbar, unsigned
> nthreads)
> +{
> +  gomp_barrier_reinit_2 (&sbar->bar, nthreads);
> +}
> +
> +static inline bool
> +gomp_simple_barrier_has_space (gomp_simple_barrier_t *sbar, unsigned
> nthreads)
>  {
> -  gomp_barrier_wait_last (&bar->bar);
> +  return gomp_barrier_has_space (&sbar->bar, nthreads);
>  }
> 
>  #endif /* GOMP_SIMPLE_BARRIER_H */
> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
> index a43398300a5..e0459046bc9 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -59,6 +59,7 @@
>  #include <stdbool.h>
>  #include <stdlib.h>
>  #include <stdarg.h>
> +#include <limits.h>
> 
>  /* Needed for memset in priority_queue.c.  */
>  #if _LIBGOMP_CHECKING_
> @@ -200,6 +201,14 @@ extern void gomp_vfatal (const char *, va_list)
>  extern void gomp_fatal (const char *, ...)
>         __attribute__ ((noreturn, format (printf, 1, 2)));
> 
> +#if _LIBGOMP_CHECKING_
> +#define gomp_assert(EXPR, MSG, ...)
> \
> +  if (!(EXPR))
> \
> +  gomp_fatal ("%s:%d " MSG, __FILE__, __LINE__, __VA_ARGS__)
> +#else
> +#define gomp_assert(...)
> +#endif
> +
>  struct gomp_task;
>  struct gomp_taskgroup;
>  struct htab;
> @@ -1099,7 +1108,7 @@ extern unsigned gomp_dynamic_max_threads (void);
>  extern void gomp_init_task (struct gomp_task *, struct gomp_task *,
>                             struct gomp_task_icv *);
>  extern void gomp_end_task (void);
> -extern void gomp_barrier_handle_tasks (gomp_barrier_state_t);
> +extern void gomp_barrier_handle_tasks (gomp_barrier_state_t, bool);
>  extern void gomp_task_maybe_wait_for_dependencies (void **);
>  extern bool gomp_create_target_task (struct gomp_device_descr *,
>                                      void (*) (void *), size_t, void
> **,
> diff --git a/libgomp/single.c b/libgomp/single.c
> index 397501daeb3..abba2976586 100644
> --- a/libgomp/single.c
> +++ b/libgomp/single.c
> @@ -77,7 +77,7 @@ GOMP_single_copy_start (void)
>      }
>    else
>      {
> -      gomp_team_barrier_wait (&thr->ts.team->barrier);
> +      gomp_team_barrier_wait (&thr->ts.team->barrier, thr-
> >ts.team_id);
> 
>        ret = thr->ts.work_share->copyprivate;
>        gomp_work_share_end_nowait ();
> @@ -98,7 +98,7 @@ GOMP_single_copy_end (void *data)
>    if (team != NULL)
>      {
>        thr->ts.work_share->copyprivate = data;
> -      gomp_team_barrier_wait (&team->barrier);
> +      gomp_team_barrier_wait (&team->barrier, thr->ts.team_id);
>      }
> 
>    gomp_work_share_end_nowait ();
> diff --git a/libgomp/task.c b/libgomp/task.c
> index 5965e781f7e..658b51c1fd2 100644
> --- a/libgomp/task.c
> +++ b/libgomp/task.c
> @@ -1549,7 +1549,7 @@ gomp_task_run_post_remove_taskgroup (struct
> gomp_task *child_task)
>  }
> 
>  void
> -gomp_barrier_handle_tasks (gomp_barrier_state_t state)
> +gomp_barrier_handle_tasks (gomp_barrier_state_t state, bool
> use_cancel)
>  {
>    struct gomp_thread *thr = gomp_thread ();
>    struct gomp_team *team = thr->ts.team;
> @@ -1570,7 +1570,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t
> state)
>       When `task_count == 0` we're not going to perform tasks anyway,
> so the
>       problem of PR122314 is naturally avoided.  */
>    if (team->task_count != 0
> -      && gomp_barrier_has_completed (state, &team->barrier))
> +      && gomp_barrier_has_completed (state, &team->barrier,
> use_cancel))
>      {
>        gomp_mutex_unlock (&team->task_lock);
>        return;
> @@ -1580,7 +1580,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t
> state)
>      {
>        if (team->task_count == 0)
>         {
> -         gomp_team_barrier_done (&team->barrier, state);
> +         gomp_team_barrier_done (&team->barrier, state, use_cancel);
>           gomp_mutex_unlock (&team->task_lock);
>           gomp_team_barrier_wake (&team->barrier, 0);
>           return;
> @@ -1617,7 +1617,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t
> state)
>        else if (team->task_count == 0
>                && gomp_team_barrier_waiting_for_tasks (&team-
> >barrier))
>         {
> -         gomp_team_barrier_done (&team->barrier, state);
> +         gomp_team_barrier_done (&team->barrier, state, use_cancel);
>           gomp_mutex_unlock (&team->task_lock);
>           gomp_team_barrier_wake (&team->barrier, 0);
>           if (to_free)
> @@ -2243,7 +2243,7 @@ GOMP_taskgroup_end (void)
>          is #pragma omp target nowait that creates an implicit
>          team with a single thread.  In this case, we want to wait
>          for all outstanding tasks in this team.  */
> -      gomp_team_barrier_wait (&team->barrier);
> +      gomp_team_barrier_wait (&team->barrier, thr->ts.team_id);
>        return;
>      }
> 
> @@ -2698,7 +2698,7 @@ GOMP_workshare_task_reduction_unregister (bool
> cancelled)
>      htab_free ((struct htab *) data[5]);
> 
>    if (!cancelled)
> -    gomp_team_barrier_wait (&team->barrier);
> +    gomp_team_barrier_wait (&team->barrier, thr->ts.team_id);
>  }
> 
>  int
> diff --git a/libgomp/team.c b/libgomp/team.c
> index cb1d3235312..512b1368af6 100644
> --- a/libgomp/team.c
> +++ b/libgomp/team.c
> @@ -109,28 +109,28 @@ gomp_thread_start (void *xdata)
>        struct gomp_team *team = thr->ts.team;
>        struct gomp_task *task = thr->task;
> 
> -      gomp_barrier_wait (&team->barrier);
> +      gomp_barrier_wait (&team->barrier, thr->ts.team_id);
> 
>        local_fn (local_data);
> -      gomp_team_barrier_wait_final (&team->barrier);
> +      gomp_team_barrier_wait_final (&team->barrier, thr->ts.team_id);
>        gomp_finish_task (task);
> -      gomp_barrier_wait_last (&team->barrier);
> +      gomp_barrier_wait_last (&team->barrier, thr->ts.team_id);
>      }
>    else
>      {
>        pool->threads[thr->ts.team_id] = thr;
> 
> -      gomp_simple_barrier_wait (&pool->threads_dock);
> +      gomp_simple_barrier_wait (&pool->threads_dock, thr-
> >ts.team_id);
>        do
>         {
>           struct gomp_team *team = thr->ts.team;
>           struct gomp_task *task = thr->task;
> 
>           local_fn (local_data);
> -         gomp_team_barrier_wait_final (&team->barrier);
> +         gomp_team_barrier_wait_final (&team->barrier, thr-
> >ts.team_id);
>           gomp_finish_task (task);
> 
> -         gomp_simple_barrier_wait (&pool->threads_dock);
> +         gomp_simple_barrier_wait (&pool->threads_dock, thr-
> >ts.team_id);
> 
>           local_fn = thr->fn;
>           local_data = thr->data;
> @@ -243,7 +243,7 @@ gomp_free_pool_helper (void *thread_pool)
>    struct gomp_thread *thr = gomp_thread ();
>    struct gomp_thread_pool *pool
>      = (struct gomp_thread_pool *) thread_pool;
> -  gomp_simple_barrier_wait_last (&pool->threads_dock);
> +  gomp_simple_barrier_wait_last (&pool->threads_dock, thr-
> >ts.team_id);
>    gomp_sem_destroy (&thr->release);
>    thr->thread_pool = NULL;
>    thr->task = NULL;
> @@ -278,10 +278,10 @@ gomp_free_thread (void *arg
> __attribute__((unused)))
>               nthr->data = pool;
>             }
>           /* This barrier undocks threads docked on pool-
> >threads_dock.  */
> -         gomp_simple_barrier_wait (&pool->threads_dock);
> +         gomp_simple_barrier_wait (&pool->threads_dock, thr-
> >ts.team_id);
>           /* And this waits till all threads have called
> gomp_barrier_wait_last
>              in gomp_free_pool_helper.  */
> -         gomp_simple_barrier_wait (&pool->threads_dock);
> +         gomp_simple_barrier_wait (&pool->threads_dock, thr-
> >ts.team_id);
>           /* Now it is safe to destroy the barrier and free the pool.
> */
>           gomp_simple_barrier_destroy (&pool->threads_dock);
> 
> @@ -457,6 +457,14 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
>    else
>      bind = omp_proc_bind_false;
> 
> +  unsigned bits_per_ull = sizeof (unsigned long long) * CHAR_BIT;
> +  int id_arr_len = ((nthreads + pool->threads_used) / bits_per_ull) +
> 1;
> +  unsigned long long new_ids[id_arr_len];
> +  for (int j = 0; j < id_arr_len; j++)
> +    {
> +      new_ids[j] = 0;
> +    }
> +
>    /* We only allow the reuse of idle threads for non-nested PARALLEL
>       regions.  This appears to be implied by the semantics of
>       threadprivate variables, but perhaps that's reading too much
> into
> @@ -464,6 +472,11 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
>       only the initial program thread will modify gomp_threads.  */
>    if (!nested)
>      {
> +      /* This current thread is always re-used in next team.  */
> +      unsigned total_reused = 1;
> +      gomp_assert (team->prev_ts.team_id == 0,
> +                  "Starting a team from thread with id %u in previous
> team\n",
> +                  team->prev_ts.team_id);
>        old_threads_used = pool->threads_used;
> 
>        if (nthreads <= old_threads_used)
> @@ -474,13 +487,7 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
>           gomp_simple_barrier_init (&pool->threads_dock, nthreads);
>         }
>        else
> -       {
> -         n = old_threads_used;
> -
> -         /* Increase the barrier threshold to make sure all new
> -            threads arrive before the team is released.  */
> -         gomp_simple_barrier_reinit (&pool->threads_dock, nthreads);
> -       }
> +       n = old_threads_used;
> 
>        /* Not true yet, but soon will be.  We're going to release all
>          threads from the dock, and those that aren't part of the
> @@ -502,6 +509,7 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
>         }
> 
>        /* Release existing idle threads.  */
> +      bool have_prepared = false;
>        for (; i < n; ++i)
>         {
>           unsigned int place_partition_off = thr-
> >ts.place_partition_off;
> @@ -643,7 +651,23 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
>           nthr->ts.team = team;
>           nthr->ts.work_share = &team->work_shares[0];
>           nthr->ts.last_work_share = NULL;
> +         /* If we're changing any threads team_id then we need to
> wait for all
> +            other threads to have reached the barrier.  */
> +         if (nthr->ts.team_id != i && !have_prepared)
> +           {
> +             gomp_simple_barrier_prepare_reinit (&pool->threads_dock,
> +                                                 thr->ts.team_id);
> +             have_prepared = true;
> +           }
>           nthr->ts.team_id = i;
> +         {
> +           unsigned idx = (i / bits_per_ull);
> +           gomp_assert (!(new_ids[idx] & (1ULL << (i %
> bits_per_ull))),
> +                        "new_ids[%u] == %llu (for `i` %u)", idx,
> new_ids[idx],
> +                        i);
> +           new_ids[idx] |= (1ULL << (i % bits_per_ull));
> +         }
> +         total_reused += 1;
>           nthr->ts.level = team->prev_ts.level + 1;
>           nthr->ts.active_level = thr->ts.active_level;
>           nthr->ts.place_partition_off = place_partition_off;
> @@ -714,13 +738,54 @@ gomp_team_start (void (*fn) (void *), void
> *data, unsigned nthreads,
>                     }
>                   break;
>                 }
> +           }
> +       }
> 
> -             /* Increase the barrier threshold to make sure all new
> -                threads and all the threads we're going to let die
> -                arrive before the team is released.  */
> -             if (affinity_count)
> -               gomp_simple_barrier_reinit (&pool->threads_dock,
> -                                           nthreads +
> affinity_count);
> +      /* If we are changing the number of threads *or* if we are
> starting new
> +        threads for any reason.  Then update the barrier accordingly.
> +
> +        The handling of the barrier here is different for the
> different
> +        designs of barrier.
> +
> +        The `posix/bar.h` design needs to "grow" to accomodate the
> extra
> +        threads that we'll wait on, then "shrink" to the size we want
> +        eventually.
> +
> +        The `linux/bar.h` design needs to assign positions for each
> thread.
> +        Some of the threads getting started will want the position of
> a thread
> +        that is currently running.  Hence we need to (1) serialise
> existing
> +        threads then (2) set up barierr state for the incoming new
> threads.
> +        Once this is done we don't need any equivalent of the
> "shrink" step
> +        later.   This does result in a longer period of serialisation
> than
> +        the posix/bar.h design, but it seems that this is a fair
> trade-off to
> +        make for the design that is faster under contention.  */
> +      if (old_threads_used != 0
> +         && (nthreads != pool->threads_dock.bar.total || i <
> nthreads))
> +       {
> +         /* If all we've done is increase the number of threads that
> we want,
> +            don't need to serialise anything (wake flags don't need
> to be
> +            adjusted).  */
> +         if (nthreads > old_threads_used && affinity_count == 0
> +             && total_reused == old_threads_used
> +             /* `have_prepared` can be used to detect whether we re-
> shuffled
> +                any threads around.  */
> +             && !have_prepared
> +             && gomp_simple_barrier_has_space (&pool->threads_dock,
> nthreads))
> +           gomp_simple_barrier_minimal_reinit (&pool->threads_dock,
> nthreads,
> +                                               nthreads -
> old_threads_used);
> +         else
> +           {
> +             /* Otherwise, we need to ensure that we've paused all
> existing
> +                threads (waiting on us to restart them) before
> adjusting their
> +                wake flags.  */
> +             if (!have_prepared)
> +               gomp_simple_barrier_prepare_reinit (&pool-
> >threads_dock,
> +                                                   thr->ts.team_id);
> +             gomp_simple_barrier_reinit_1 (&pool->threads_dock,
> nthreads,
> +                                           nthreads <= total_reused
> +                                             ? 0
> +                                             : nthreads -
> total_reused,
> +                                           new_ids);
>             }
>         }
> 
> @@ -868,9 +933,9 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
> 
>   do_release:
>    if (nested)
> -    gomp_barrier_wait (&team->barrier);
> +    gomp_barrier_wait (&team->barrier, thr->ts.team_id);
>    else
> -    gomp_simple_barrier_wait (&pool->threads_dock);
> +    gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);
> 
>    /* Decrease the barrier threshold to match the number of threads
>       that should arrive back at the end of this team.  The extra
> @@ -888,8 +953,7 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
>        if (affinity_count)
>         diff = -affinity_count;
> 
> -      gomp_simple_barrier_reinit (&pool->threads_dock, nthreads);
> -
> +      gomp_simple_barrier_reinit_2 (&pool->threads_dock, nthreads);
>  #ifdef HAVE_SYNC_BUILTINS
>        __sync_fetch_and_add (&gomp_managed_threads, diff);
>  #else
> @@ -949,12 +1013,13 @@ gomp_team_end (void)
>  {
>    struct gomp_thread *thr = gomp_thread ();
>    struct gomp_team *team = thr->ts.team;
> +  unsigned team_id = thr->ts.team_id;
> 
>    /* This barrier handles all pending explicit threads.
>       As #pragma omp cancel parallel might get awaited count in
>       team->barrier in a inconsistent state, we need to use a
> different
>       counter here.  */
> -  gomp_team_barrier_wait_final (&team->barrier);
> +  gomp_team_barrier_wait_final (&team->barrier, thr->ts.team_id);
>    if (__builtin_expect (team->team_cancelled, 0))
>      {
>        struct gomp_work_share *ws = team->work_shares_to_free;
> @@ -985,7 +1050,7 @@ gomp_team_end (void)
>  #endif
>        /* This barrier has gomp_barrier_wait_last counterparts
>          and ensures the team can be safely destroyed.  */
> -      gomp_barrier_wait (&team->barrier);
> +      gomp_barrier_wait (&team->barrier, team_id);
>      }
> 
>    if (__builtin_expect (team->work_shares[0].next_alloc != NULL, 0))
> @@ -1049,7 +1114,7 @@ gomp_pause_pool_helper (void *thread_pool)
>    struct gomp_thread *thr = gomp_thread ();
>    struct gomp_thread_pool *pool
>      = (struct gomp_thread_pool *) thread_pool;
> -  gomp_simple_barrier_wait_last (&pool->threads_dock);
> +  gomp_simple_barrier_wait_last (&pool->threads_dock, thr-
> >ts.team_id);
>    gomp_sem_destroy (&thr->release);
>    thr->thread_pool = NULL;
>    thr->task = NULL;
> @@ -1081,10 +1146,10 @@ gomp_pause_host (void)
>               thrs[i] = gomp_thread_to_pthread_t (nthr);
>             }
>           /* This barrier undocks threads docked on pool-
> >threads_dock.  */
> -         gomp_simple_barrier_wait (&pool->threads_dock);
> +         gomp_simple_barrier_wait (&pool->threads_dock, thr-
> >ts.team_id);
>           /* And this waits till all threads have called
> gomp_barrier_wait_last
>              in gomp_pause_pool_helper.  */
> -         gomp_simple_barrier_wait (&pool->threads_dock);
> +         gomp_simple_barrier_wait (&pool->threads_dock, thr-
> >ts.team_id);
>           /* Now it is safe to destroy the barrier and free the pool.
> */
>           gomp_simple_barrier_destroy (&pool->threads_dock);
> 
> diff --git a/libgomp/testsuite/libgomp.c/primary-thread-tasking.c
> b/libgomp/testsuite/libgomp.c/primary-thread-tasking.c
> new file mode 100644
> index 00000000000..4b70dca30a2
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.c/primary-thread-tasking.c
> @@ -0,0 +1,80 @@
> +/* Test to check our primary thread can execute some tasks while
> waiting for
> +   other threads.  This to check an edge-case in a recent
> implementation of the
> +   barrier tasking mechanism.
> +   I don't believe there's any way to guarantee that a task will be
> run on a
> +   given thread.  Hence I don't know anywhere I can put an `abort`
> and say we
> +   failed.  However I can set things up so that we'll timeout if the
> primary
> +   thread is not executing any tasks.  That timeout will at least
> count as a
> +   fail.
> +
> +   Idea here being that we keep spawning tasks until one is handled
> by the
> +   primary thread.  Meanwhile we give the secondary threads lots of
> +   opportunities to sleep and let the primary thread take a task.  */
> +/* { dg-do run  { target *-linux-* } } */
> +
> +#define _GNU_SOURCE
> +#include <omp.h>
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <linux/futex.h>
> +#include <assert.h>
> +#include <stdatomic.h>
> +#include <limits.h>
> +
> +int wake_flag = 0;
> +
> +void
> +continue_until_on_thread0 ()
> +{
> +  if (omp_get_thread_num () == 0)
> +    {
> +      __atomic_fetch_add (&wake_flag, 1, memory_order_relaxed);
> +      syscall (SYS_futex, &wake_flag, FUTEX_WAKE |
> FUTEX_PRIVATE_FLAG, INT_MAX);
> +    }
> +  else
> +    {
> +      /* If the flag has been set try again.  Otherwise put another
> few tasks
> +       * on the task queue.  */
> +      if (__atomic_load_n (&wake_flag, memory_order_relaxed))
> +       {
> +         return;
> +       }
> +#pragma omp task
> +      continue_until_on_thread0 ();
> +#pragma omp task
> +      continue_until_on_thread0 ();
> +#pragma omp task
> +      continue_until_on_thread0 ();
> +      syscall (SYS_futex, &wake_flag, FUTEX_WAIT |
> FUTEX_PRIVATE_FLAG, 0, NULL);
> +    }
> +}
> +
> +int
> +foo ()
> +{
> +#pragma omp parallel
> +  {
> +    if (omp_get_thread_num () != 0)
> +      {
> +#pragma omp task
> +       continue_until_on_thread0 ();
> +#pragma omp task
> +       continue_until_on_thread0 ();
> +       /* Wait on the master thread to have executed one of the
> tasks.  */
> +       int val = __atomic_load_n (&wake_flag, memory_order_acquire);
> +       while (val == 0)
> +         {
> +           syscall (SYS_futex, &wake_flag, FUTEX_WAIT |
> FUTEX_PRIVATE_FLAG,
> +                    val, NULL);
> +           val = __atomic_load_n (&wake_flag, memory_order_acquire);
> +         }
> +      }
> +  }
> +}
> +
> +int
> +main ()
> +{
> +  foo ();
> +  return 0;
> +}
> diff --git a/libgomp/work.c b/libgomp/work.c
> index eae972ae2d1..b75b97da181 100644
> --- a/libgomp/work.c
> +++ b/libgomp/work.c
> @@ -240,10 +240,14 @@ gomp_work_share_end (void)
>        return;
>      }
> 
> -  bstate = gomp_barrier_wait_start (&team->barrier);
> +  bstate = gomp_barrier_wait_start (&team->barrier, thr->ts.team_id);
> 
>    if (gomp_barrier_last_thread (bstate))
>      {
> +      /* For some targets the state returning "last" no longer
> indicates that
> +        we're *already* last, instead it indicates that we *should
> be* last.
> +        Perform the relevant synchronisation.  */
> +      gomp_team_barrier_ensure_last (&team->barrier, thr->ts.team_id,
> bstate);
>        if (__builtin_expect (thr->ts.last_work_share != NULL, 1))
>         {
>           team->work_shares_to_free = thr->ts.work_share;
> @@ -251,7 +255,7 @@ gomp_work_share_end (void)
>         }
>      }
> 
> -  gomp_team_barrier_wait_end (&team->barrier, bstate);
> +  gomp_team_barrier_wait_end (&team->barrier, bstate, thr-
> >ts.team_id);
>    thr->ts.last_work_share = NULL;
>  }
> 
> @@ -266,19 +270,27 @@ gomp_work_share_end_cancel (void)
>    gomp_barrier_state_t bstate;
> 
>    /* Cancellable work sharing constructs cannot be orphaned.  */
> -  bstate = gomp_barrier_wait_cancel_start (&team->barrier);
> +  bstate = gomp_barrier_wait_cancel_start (&team->barrier, thr-
> >ts.team_id);
> 
>    if (gomp_barrier_last_thread (bstate))
>      {
> -      if (__builtin_expect (thr->ts.last_work_share != NULL, 1))
> +      if (gomp_team_barrier_ensure_cancel_last (&team->barrier, thr-
> >ts.team_id,
> +                                               bstate))
>         {
> -         team->work_shares_to_free = thr->ts.work_share;
> -         free_work_share (team, thr->ts.last_work_share);
> +         if (__builtin_expect (thr->ts.last_work_share != NULL, 1))
> +           {
> +             team->work_shares_to_free = thr->ts.work_share;
> +             free_work_share (team, thr->ts.last_work_share);
> +           }
>         }
> +      else
> +       gomp_reset_cancellable_primary_threadgen (&team->barrier,
> +                                                 thr->ts.team_id);
>      }
>    thr->ts.last_work_share = NULL;
> 
> -  return gomp_team_barrier_wait_cancel_end (&team->barrier, bstate);
> +  return gomp_team_barrier_wait_cancel_end (&team->barrier, bstate,
> +                                           thr->ts.team_id);
>  }
> 
>  /* The current thread is done with its current work sharing
> construct.
> --
> 2.43.0
RE: [PATCH 3/5] libgomp: Implement "flat" barrier for linux/ target

Reply via email to