> -----Original Message-----
> From: [email protected] <[email protected]>
> Sent: 26 November 2025 18:42
> To: [email protected]; Jakub Jelinek <[email protected]>; Tobias
> Burnus <[email protected]>
> Cc: Julian Brown <[email protected]>; Thomas Schwinge
> <[email protected]>; Andrew Stubbs <[email protected]>; Tom de
> Vries <[email protected]>; Sebastian Huber <sebastian.huber@embedded-
> brains.de>; Matthew Malcomson <[email protected]>
> Subject: [PATCH 5/5] libgomp: Move thread task re-initialisation into
> threads
>
> External email: Use caution opening links or attachments
>
>
> From: Matthew Malcomson <[email protected]>
>
> This patch is a separate beneficial step of the work I'm doing towards
> PR119588. In that bug I describe how some internal workloads are
> hitting some performance problems w.r.t. the creation of many parallel
> regions one after each other.
>
> I'm attempting to reduce overhead of parallel region creation when
> using a large number of threads. The two patches earlier in this
> patchset rework the barrier implementation to something that performs
> much better in this particular case:
>
> This is a patch built on top of those, but with appropriate rebase
> could be independent. It addresses a different area where I've seen a
> performance impact. It's the area that I mentioned in the below mail
> to the GCC mailing list.
> https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html
>
> The fundamental point is that the cost of teardown & setup of a
> parallel region for LLVM is very close to the cost of a single
> barrier. For us there is a significant chunk of extra work in the
> actual teardown and setup.
>
> When re-using a team with no changes in the affinity ICV etc it would
> seem that the ideal case is similar to LLVM -- (parallel region
> creation being about 16% slower than a barrier for LLVM, as opposed to
> the 4.3x slower for 144 threads that I'm seeing at the moment for
> GNU).
> This patch gets nowhere near that ideal. In this patch I observe that
> the majority of the performance overhead in this case comes from re-
> initialising the existing threads. While all existing threads are
> waiting on the primary thread, that primary thread is iterating over
> each secondary threads data structures and setting it as needed for
> the next parallel region.
>
> There are a number of stores that could be relatively easily performed
> in each secondary thread after it is released by the primary thread --
> removing this work from the serialised region. This patch proposes
> that much simpler setup. This re-initialisation scales with number of
> threads -- which is why it has such an impact on the problematic case
> I have been looking at.
>
> Only subtlety here is that in order to ensure memory safety we need to
> avoid performing this initialisation when `thr->fn == NULL`. If a
> secondary non-nested thread gets "left to exit" when a new team is
> started the primary thread could race ahead, cache the new team, and
> free the team that this secondary thread was using all before the
> secondary thread goes through the new initialisation code. That could
> lead to a use-after-free. Hence we have to early-exit when this
> thread is not going to perform any more work.
> - N.b. When running TSAN before and after I realised that this change
> happens to remove a race between `gomp_finish_task` in the secondary
> thread and `gomp_init_task` in the primary thread (when re-using a
> team and hence `nthr->task` is in the same place as the threads
> local
> `task` variable).
>
> All threads that are not going to be used in the next region do
> unnecessarily re-init their `team` data structure etc, but since `thr-
> >fn` is NULL they exit the do/while loop before that makes little
> difference, and the work is spread across threads so it shouldn't
> block other useful work.
>
> I imagine there is a good chunk of performance to be gained in adding
> more logic here:
> 1) Something recording whether the affinity setup has changed since
> the
> last non-nested team and if not removing the calculation of places
> and assignment to `nthr->ts.place_partition_*` variables.
> 2) I could move more members that are assigned multiple times to
> `team`
> for each secondary thread to take.
> - `data` pointer.
> - `num_teams`.
> - `team_num`.
> - Honestly should have looked into this in the patch -- noticed it
> while writing this cover letter and will look into whether this
> is
> feasible relatively soon.
> 3) If we can identify that we're re-using a team from the last
> parallel
> region (and affinity ICV has not changed) it seems that we could
> avoid re-initialising some of its fields:
> - ordered_release[i] should already point to `nthr`?
> - ts.team_id should already be set.
> Overall if we can identify that we're using the same team as was
> cached and we don't need to change anything we should be able to get
> away with drastically less work in the serial part of the call to
> GOMP_parallel.
>
> I did run the micro-benchmark I'm looking at (mentioned in PR119588)
> on some hacked up versions of the above and they seemed to provide
> measurable gains. I haven't tried to put them all in this patch due
> to lack of time and the relatively higher complexity of the
> bookkeeping.
> I'm hoping this is a straight-forward step in the right direction.
>
> -------
> Testing done (note -- much of the testing done on top of my other
> patches rather than by itself):
> - Bootstrap & regtest on x86_64 and aarch64
> - Testsuite ran with OMP_WAIT_POLICY=PASSIVE as well.
> - Also done with `--enable-linux-futex=no` for posix/ target.
> - Also done with `_LIBGOMP_CHECKING_` set for assertions.
> - Cross compiler & regtest on arm (qemu).
> - TSAN done combining all my other patches upstream.
Hi,
ping: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702030.html
Thanks,
Prathamesh
>
> libgomp/ChangeLog:
>
> PR libgomp/119588
> * libgomp.h (struct gomp_team): Add fields for communication
> at team start time.
> * team.c (gomp_thread_start): Initialise thread-local data
> using
> members stored on team.
> (gomp_team_start): Primary thread stores data on team
> structure
> instead of copying that data to each secondary threads
> gomp_thread structure.
>
> Signed-off-by: Matthew Malcomson <[email protected]>
> ---
> libgomp/libgomp.h | 8 ++++++++
> libgomp/team.c | 37 ++++++++++++++++++++++++++-----------
> 2 files changed, 34 insertions(+), 11 deletions(-)
>
> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index
> 4269a3f4e60..b188567ca28 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -866,6 +866,14 @@ struct gomp_team
> /* Number of tasks waiting for their completion event to be
> fulfilled. */
> unsigned int task_detach_count;
>
> + /* Cache to help initialise each secondary threads `task` when re-
> using a
> + team very many times. This allows each secondary thread to
> perform the
> + re-initialisation instead of having the primary thread need to
> perform
> + that initialisation. */
> + struct gomp_task_icv task_icv_init;
> + struct gomp_task *task_parent_init;
> + struct gomp_taskgroup *task_taskgroup;
> +
> /* This array contains structures for implicit tasks. */
> struct gomp_task implicit_task[];
> };
> diff --git a/libgomp/team.c b/libgomp/team.c index
> 30aefcd22b6..0553b8d83a3 100644
> --- a/libgomp/team.c
> +++ b/libgomp/team.c
> @@ -158,6 +158,23 @@ gomp_thread_start (void *xdata)
> if (!gomp_barrier_can_hold (&team->barrier) || thr->fn ==
> NULL)
> gomp_simple_barrier_wait (&pool->threads_dock);
>
> + /* Early exit because if `thr->fn == NULL` then we can't
> guarantee
> + the `thr->ts.team` hasn't been freed by the primary
> thread racing
> + ahead. Hence we don't want to write to it. */
> + if (thr->fn == NULL)
> + break;
> + team = thr->ts.team;
> + thr->ts.work_share = &team->work_shares[0];
> + thr->ts.last_work_share = NULL; #ifdef HAVE_SYNC_BUILTINS
> + thr->ts.single_count = 0;
> +#endif
> + thr->ts.static_trip = 0;
> + thr->task = &team->implicit_task[thr->ts.team_id];
> + gomp_init_task (thr->task, team->task_parent_init,
> + &team->task_icv_init);
> + thr->task->taskgroup = team->task_taskgroup;
> +
> local_fn = thr->fn;
> local_data = thr->data;
> thr->fn = NULL;
> @@ -437,6 +454,15 @@ gomp_team_start (void (*fn) (void *), void *data,
> unsigned nthreads,
> thr->task->taskgroup = taskgroup;
> team->implicit_task[0].icv.nthreads_var = nthreads_var;
> team->implicit_task[0].icv.bind_var = bind_var;
> + /* Copy entirely rather than copy pointer so that we have an
> immutable copy
> + for the secondary threads to take from for the time-frame that
> said
> + secondary threads are executing on. Can't copy from
> + `team->implicit_task[0].icv` because primary thread could have
> adjusted
> + its per-task ICV by the time the secondary thread gets around to
> + initialising things. */
> + team->task_icv_init = team->implicit_task[0].icv;
> + team->task_parent_init = task; team->task_taskgroup = taskgroup;
>
> if (nthreads == 1)
> return;
> @@ -725,26 +751,15 @@ gomp_team_start (void (*fn) (void *), void
> *data, unsigned nthreads,
> else
> nthr = pool->threads[i];
> __atomic_store_n (&nthr->ts.team, team, MEMMODEL_RELEASE);
> - nthr->ts.work_share = &team->work_shares[0];
> - nthr->ts.last_work_share = NULL;
> nthr->ts.team_id = i;
> nthr->ts.level = team->prev_ts.level + 1;
> nthr->ts.active_level = thr->ts.active_level;
> nthr->ts.place_partition_off = place_partition_off;
> nthr->ts.place_partition_len = place_partition_len;
> nthr->ts.def_allocator = thr->ts.def_allocator; -#ifdef
> HAVE_SYNC_BUILTINS
> - nthr->ts.single_count = 0;
> -#endif
> - nthr->ts.static_trip = 0;
> nthr->num_teams = thr->num_teams;
> nthr->team_num = thr->team_num;
> - nthr->task = &team->implicit_task[i];
> nthr->place = place;
> - gomp_init_task (nthr->task, task, icv);
> - team->implicit_task[i].icv.nthreads_var = nthreads_var;
> - team->implicit_task[i].icv.bind_var = bind_var;
> - nthr->task->taskgroup = taskgroup;
> nthr->fn = fn;
> nthr->data = data;
> team->ordered_release[i] = &nthr->release;
> --
> 2.43.0