Hi!

Matthew Malcomson <[email protected]> writes:

> From: Matthew Malcomson <[email protected]>
>
> This patch is a separate beneficial step of the work I'm doing towards
> PR119588.  In that bug I describe how some internal workloads are
> hitting some performance problems w.r.t. the creation of many parallel
> regions one after each other.
>
> I'm attempting to reduce overhead of parallel region creation when
> using a large number of threads.  The two patches earlier in this
> patchset rework the barrier implementation to something that performs
> much better in this particular case:
>
> This is a patch built on top of those, but with appropriate rebase
> could be independent.  It addresses a different area where I've seen a
> performance impact.  It's the area that I mentioned in the below mail
> to the GCC mailing list.
> https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html
>
> The fundamental point is that the cost of teardown & setup of a parallel
> region for LLVM is very close to the cost of a single barrier.  For us
> there is a significant chunk of extra work in the actual teardown and
> setup.
>
> When re-using a team with no changes in the affinity ICV etc it would
> seem that the ideal case is similar to LLVM -- (parallel region creation
> being about 16% slower than a barrier for LLVM, as opposed to the 4.3x
> slower for 144 threads that I'm seeing at the moment for GNU).
> This patch gets nowhere near that ideal.  In this patch I observe that
> the majority of the performance overhead in this case comes from
> re-initialising the existing threads.  While all existing threads are
> waiting on the primary thread, that primary thread is iterating over
> each secondary threads data structures and setting it as needed for the
> next parallel region.
>
> There are a number of stores that could be relatively easily performed
> in each secondary thread after it is released by the primary thread --
> removing this work from the serialised region.  This patch proposes that
> much simpler setup.  This re-initialisation scales with number of
> threads -- which is why it has such an impact on the problematic case I
> have been looking at.
>
> Only subtlety here is that in order to ensure memory safety we need to
> avoid performing this initialisation when `thr->fn == NULL`.  If a
> secondary non-nested thread gets "left to exit" when a new team is
> started the primary thread could race ahead, cache the new team, and
> free the team that this secondary thread was using all before the
> secondary thread goes through the new initialisation code.  That could
> lead to a use-after-free.  Hence we have to early-exit when this thread
> is not going to perform any more work.
> - N.b. When running TSAN before and after I realised that this change
>   happens to remove a race between `gomp_finish_task` in the secondary
>   thread and `gomp_init_task` in the primary thread (when re-using a
>   team and hence `nthr->task` is in the same place as the threads local
>   `task` variable).
>
> All threads that are not going to be used in the next region do
> unnecessarily re-init their `team` data structure etc, but since
> `thr->fn` is NULL they exit the do/while loop before that makes little
> difference, and the work is spread across threads so it shouldn't block
> other useful work.
>
> I imagine there is a good chunk of performance to be gained in adding
> more logic here:
> 1) Something recording whether the affinity setup has changed since the
>    last non-nested team and if not removing the calculation of places
>    and assignment to `nthr->ts.place_partition_*` variables.
> 2) I could move more members that are assigned multiple times to `team`
>    for each secondary thread to take.
>    - `data` pointer.
>    - `num_teams`.
>    - `team_num`.
>    - Honestly should have looked into this in the patch -- noticed it
>      while writing this cover letter and will look into whether this is
>      feasible relatively soon.
> 3) If we can identify that we're re-using a team from the last parallel
>    region (and affinity ICV has not changed) it seems that we could
>    avoid re-initialising some of its fields:
>    - ordered_release[i] should already point to `nthr`?
>    - ts.team_id should already be set.
> Overall if we can identify that we're using the same team as was
> cached and we don't need to change anything we should be able to get
> away with drastically less work in the serial part of the call to
> GOMP_parallel.
>
> I did run the micro-benchmark I'm looking at (mentioned in PR119588) on
> some hacked up versions of the above and they seemed to provide
> measurable gains.  I haven't tried to put them all in this patch due to
> lack of time and the relatively higher complexity of the bookkeeping.
> I'm hoping this is a straight-forward step in the right direction.
>
> -------
> Testing done (note -- much of the testing done on top of my other
> patches rather than by itself):
> - Bootstrap & regtest on x86_64 and aarch64
>   - Testsuite ran with OMP_WAIT_POLICY=PASSIVE as well.
>   - Also done with `--enable-linux-futex=no` for posix/ target.
>   - Also done with `_LIBGOMP_CHECKING_` set for assertions.
> - Cross compiler & regtest on arm (qemu).
> - TSAN done combining all my other patches upstream.
>
> libgomp/ChangeLog:
>
>       PR libgomp/119588
>       * libgomp.h (struct gomp_team): Add fields for communication
>       at team start time.
>       * team.c (gomp_thread_start): Initialise thread-local data using
>       members stored on team.
>       (gomp_team_start): Primary thread stores data on team structure
>       instead of copying that data to each secondary threads
>       gomp_thread structure.
>
> Signed-off-by: Matthew Malcomson <[email protected]>
> ---
>  libgomp/libgomp.h |  8 ++++++++
>  libgomp/team.c    | 37 ++++++++++++++++++++++++++-----------
>  2 files changed, 34 insertions(+), 11 deletions(-)

I've actually made pretty much this same change for the GCN
configuration of libgomp, resulting in a semantic conflict with this
patch (yet to upstream it, though).

It may make sense to extract the common thread initialization (including
from this cache) into a function so that it can be used in other
libgomp configurations.

> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
> index 4269a3f4e60..b188567ca28 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -866,6 +866,14 @@ struct gomp_team
>    /* Number of tasks waiting for their completion event to be fulfilled.  */
>    unsigned int task_detach_count;
>  
> +  /* Cache to help initialise each secondary threads `task` when re-using a
> +     team very many times.  This allows each secondary thread to perform the
> +     re-initialisation instead of having the primary thread need to perform
> +     that initialisation.  */
> +  struct gomp_task_icv task_icv_init;
> +  struct gomp_task *task_parent_init;
> +  struct gomp_taskgroup *task_taskgroup;
> +
>    /* This array contains structures for implicit tasks.  */
>    struct gomp_task implicit_task[];
>  };
> diff --git a/libgomp/team.c b/libgomp/team.c
> index 30aefcd22b6..0553b8d83a3 100644
> --- a/libgomp/team.c
> +++ b/libgomp/team.c
> @@ -158,6 +158,23 @@ gomp_thread_start (void *xdata)
>         if (!gomp_barrier_can_hold (&team->barrier) || thr->fn == NULL)
>           gomp_simple_barrier_wait (&pool->threads_dock);
>  
> +       /* Early exit because if `thr->fn == NULL` then we can't guarantee
> +          the `thr->ts.team` hasn't been freed by the primary thread racing
> +          ahead.  Hence we don't want to write to it.  */
> +       if (thr->fn == NULL)
> +         break;
> +       team = thr->ts.team;
> +       thr->ts.work_share = &team->work_shares[0];
> +       thr->ts.last_work_share = NULL;
> +#ifdef HAVE_SYNC_BUILTINS
> +       thr->ts.single_count = 0;
> +#endif
> +       thr->ts.static_trip = 0;
> +       thr->task = &team->implicit_task[thr->ts.team_id];
> +       gomp_init_task (thr->task, team->task_parent_init,
> +                       &team->task_icv_init);
> +       thr->task->taskgroup = team->task_taskgroup;
> +
>         local_fn = thr->fn;
>         local_data = thr->data;
>         thr->fn = NULL;
> @@ -437,6 +454,15 @@ gomp_team_start (void (*fn) (void *), void *data, 
> unsigned nthreads,
>    thr->task->taskgroup = taskgroup;
>    team->implicit_task[0].icv.nthreads_var = nthreads_var;
>    team->implicit_task[0].icv.bind_var = bind_var;
> +  /* Copy entirely rather than copy pointer so that we have an immutable copy
> +     for the secondary threads to take from for the time-frame that said
> +     secondary threads are executing on.  Can't copy from
> +     `team->implicit_task[0].icv` because primary thread could have adjusted
> +     its per-task ICV by the time the secondary thread gets around to
> +     initialising things.   */
> +  team->task_icv_init = team->implicit_task[0].icv;
> +  team->task_parent_init = task;
> +  team->task_taskgroup = taskgroup;
>  
>    if (nthreads == 1)
>      return;
> @@ -725,26 +751,15 @@ gomp_team_start (void (*fn) (void *), void *data, 
> unsigned nthreads,
>         else
>           nthr = pool->threads[i];
>         __atomic_store_n (&nthr->ts.team, team, MEMMODEL_RELEASE);
> -       nthr->ts.work_share = &team->work_shares[0];
> -       nthr->ts.last_work_share = NULL;
>         nthr->ts.team_id = i;
>         nthr->ts.level = team->prev_ts.level + 1;
>         nthr->ts.active_level = thr->ts.active_level;
>         nthr->ts.place_partition_off = place_partition_off;
>         nthr->ts.place_partition_len = place_partition_len;
>         nthr->ts.def_allocator = thr->ts.def_allocator;
> -#ifdef HAVE_SYNC_BUILTINS
> -       nthr->ts.single_count = 0;
> -#endif
> -       nthr->ts.static_trip = 0;
>         nthr->num_teams = thr->num_teams;
>         nthr->team_num = thr->team_num;
> -       nthr->task = &team->implicit_task[i];
>         nthr->place = place;
> -       gomp_init_task (nthr->task, task, icv);
> -       team->implicit_task[i].icv.nthreads_var = nthreads_var;
> -       team->implicit_task[i].icv.bind_var = bind_var;
> -       nthr->task->taskgroup = taskgroup;
>         nthr->fn = fn;
>         nthr->data = data;
>         team->ordered_release[i] = &nthr->release;

-- 
Arsen Arsenović

Attachment: signature.asc
Description: PGP signature

Reply via email to