From: Matthew Malcomson <[email protected]> This patch is a separate beneficial step of the work I'm doing towards PR119588. In that bug I describe how some internal workloads are hitting some performance problems w.r.t. the creation of many parallel regions one after each other.
I'm attempting to reduce overhead of parallel region creation when using a large number of threads. I've posted two patches to rework the barrier implementation to something that performs much better in this particular case: https://gcc.gnu.org/pipermail/gcc-patches/2025-September/695005.html https://gcc.gnu.org/pipermail/gcc-patches/2025-October/698257.html This is a separate patch to address a different area where I've seen a performance impact. It's the area that I mentioned in the below mail to the GCC mailing list. https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html The fundamental point is that the cost of teardown & setup of a parallel region for LLVM is very close to the cost of a single barrier. For us there is a significant chunk of extra work in the actual teardown and setup. When re-using a team with no changes in the affinity ICV etc it would seem that the ideal case is similar to LLVM -- (parallel region creation being about 16% slower than a barrier for LLVM, as opposed to the 4.3x slower for 144 threads that I'm seeing at the moment for GNU). This patch gets nowhere near that ideal. In this patch I observe that the majority of the performance overhead in this case comes from re-initialising the existing threads. While all existing threads are waiting on the primary thread, that primary thread is iterating over each secondary threads data structures and setting it as needed for the next parallel region. There are a number of stores that could be relatively easily performed in each secondary thread after it is released by the primary thread -- removing this work from the serialised region. This patch proposes that much simpler setup. This re-initialisation scales with number of threads -- which is why it has such an impact on the problematic case I have been looking at. Only subtlety here is that in order to ensure memory safety we need to avoid performing this initialisation when `thr->fn == NULL`. If a secondary non-nested thread gets "left to exit" when a new team is started the primary thread could race ahead, cache the new team, and free the team that this secondary thread was using all before the secondary thread goes through the new initialisation code. That could lead to a use-after-free. Hence we have to early-exit when this thread is not going to perform any more work. - N.b. When running TSAN before and after I realised that this change happens to remove a race between `gomp_finish_task` in the secondary thread and `gomp_init_task` in the primary thread (when re-using a team and hence `nthr->task` is in the same place as the threads local `task` variable). All threads that are not going to be used in the next region do unnecessarily re-init their `team` data structure etc, but since `thr->fn` is NULL they exit the do/while loop before that makes little difference, and the work is spread across threads so it shouldn't block other useful work. I imagine there is a good chunk of performance to be gained in adding more logic here: 1) Something recording whether the affinity setup has changed since the last non-nested team and if not removing the calculation of places and assignment to `nthr->ts.place_partition_*` variables. 2) I could move more members that are assigned multiple times to `team` for each secondary thread to take. - `data` pointer. - `num_teams`. - `team_num`. - Honestly should have looked into this in the patch -- noticed it while writing this cover letter and will look into whether this is feasible relatively soon. 3) If we can identify that we're re-using a team from the last parallel region (and affinity ICV has not changed) it seems that we could avoid re-initialising some of its fields: - ordered_release[i] should already point to `nthr`? - ts.team_id should already be set. Overall if we can identify that we're using the same team as was cached and we don't need to change anything we should be able to get away with drastically less work in the serial part of the call to GOMP_parallel. I did run the micro-benchmark I'm looking at (mentioned in PR119588) on some hacked up versions of the above and they seemed to provide measurable gains. I haven't tried to put them all in this patch due to lack of time and the relatively higher complexity of the bookkeeping. I'm hoping this is a straight-forward step in the right direction. ------- Testing done (note -- much of the testing done on top of my other patches rather than by itself): - Bootstrap & regtest on x86_64 and aarch64 - Testsuite ran with OMP_WAIT_POLICY=PASSIVE as well. - Also done with `--enable-linux-futex=no` for posix/ target. - Also done with `_LIBGOMP_CHECKING_` set for assertions. - Cross compiler & regtest on arm (qemu). - TSAN done combining all my other patches upstream. libgomp/ChangeLog: PR libgomp/119588 * libgomp.h (struct gomp_team): Add fields for communication at team start time. * team.c (gomp_thread_start): Initialise thread-local data using members stored on team. (gomp_team_start): Primary thread stores data on team structure instead of copying that data to each secondary threads gomp_thread structure. Signed-off-by: Matthew Malcomson <[email protected]> --- libgomp/libgomp.h | 8 ++++++++ libgomp/team.c | 36 +++++++++++++++++++++++++----------- 2 files changed, 33 insertions(+), 11 deletions(-) diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a43398300a5..0dac5871d29 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -857,6 +857,14 @@ struct gomp_team /* Number of tasks waiting for their completion event to be fulfilled. */ unsigned int task_detach_count; + /* Cache to help initialise each secondary threads `task` when re-using a + team very many times. This allows each secondary thread to perform the + re-initialisation instead of having the primary thread need to perform + that initialisation. */ + struct gomp_task_icv task_icv_init; + struct gomp_task *task_parent_init; + struct gomp_taskgroup *task_taskgroup; + /* This array contains structures for implicit tasks. */ struct gomp_task implicit_task[]; }; diff --git a/libgomp/team.c b/libgomp/team.c index cb1d3235312..893f4f9faea 100644 --- a/libgomp/team.c +++ b/libgomp/team.c @@ -131,6 +131,22 @@ gomp_thread_start (void *xdata) gomp_finish_task (task); gomp_simple_barrier_wait (&pool->threads_dock); + /* Early exit because if `thr->fn == NULL` then we can't guarantee + the `thr->ts.team` hasn't been freed by the primary thread racing + ahead. Hence we don't want to write to it. */ + if (thr->fn == NULL) + break; + team = thr->ts.team; + thr->ts.work_share = &team->work_shares[0]; + thr->ts.last_work_share = NULL; +#ifdef HAVE_SYNC_BUILTINS + thr->ts.single_count = 0; +#endif + thr->ts.static_trip = 0; + thr->task = &team->implicit_task[thr->ts.team_id]; + gomp_init_task (thr->task, team->task_parent_init, + &team->task_icv_init); + thr->task->taskgroup = team->task_taskgroup; local_fn = thr->fn; local_data = thr->data; @@ -383,6 +399,15 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads, thr->task->taskgroup = taskgroup; team->implicit_task[0].icv.nthreads_var = nthreads_var; team->implicit_task[0].icv.bind_var = bind_var; + /* Copy entirely rather than copy pointer so that we have an immutable copy + for the secondary threads to take from for the time-frame that said + secondary threads are executing on. Can't copy from + `team->implicit_task[0].icv` because primary thread could have adjusted + its per-task ICV by the time the secondary thread gets around to + initialising things. */ + team->task_icv_init = team->implicit_task[0].icv; + team->task_parent_init = task; + team->task_taskgroup = taskgroup; if (nthreads == 1) return; @@ -641,26 +666,15 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads, else nthr = pool->threads[i]; nthr->ts.team = team; - nthr->ts.work_share = &team->work_shares[0]; - nthr->ts.last_work_share = NULL; nthr->ts.team_id = i; nthr->ts.level = team->prev_ts.level + 1; nthr->ts.active_level = thr->ts.active_level; nthr->ts.place_partition_off = place_partition_off; nthr->ts.place_partition_len = place_partition_len; nthr->ts.def_allocator = thr->ts.def_allocator; -#ifdef HAVE_SYNC_BUILTINS - nthr->ts.single_count = 0; -#endif - nthr->ts.static_trip = 0; nthr->num_teams = thr->num_teams; nthr->team_num = thr->team_num; - nthr->task = &team->implicit_task[i]; nthr->place = place; - gomp_init_task (nthr->task, task, icv); - team->implicit_task[i].icv.nthreads_var = nthreads_var; - team->implicit_task[i].icv.bind_var = bind_var; - nthr->task->taskgroup = taskgroup; nthr->fn = fn; nthr->data = data; team->ordered_release[i] = &nthr->release; -- 2.43.0
