Hi! Matthew Malcomson <[email protected]> writes:
> From: Matthew Malcomson <[email protected]> > > This patch is a separate beneficial step of the work I'm doing towards > PR119588. In that bug I describe how some internal workloads are > hitting some performance problems w.r.t. the creation of many parallel > regions one after each other. > > I'm attempting to reduce overhead of parallel region creation when > using a large number of threads. The two patches earlier in this > patchset rework the barrier implementation to something that performs > much better in this particular case: > > This is a patch built on top of those, but with appropriate rebase > could be independent. It addresses a different area where I've seen a > performance impact. It's the area that I mentioned in the below mail > to the GCC mailing list. > https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html > > The fundamental point is that the cost of teardown & setup of a parallel > region for LLVM is very close to the cost of a single barrier. For us > there is a significant chunk of extra work in the actual teardown and > setup. > > When re-using a team with no changes in the affinity ICV etc it would > seem that the ideal case is similar to LLVM -- (parallel region creation > being about 16% slower than a barrier for LLVM, as opposed to the 4.3x > slower for 144 threads that I'm seeing at the moment for GNU). > This patch gets nowhere near that ideal. In this patch I observe that > the majority of the performance overhead in this case comes from > re-initialising the existing threads. While all existing threads are > waiting on the primary thread, that primary thread is iterating over > each secondary threads data structures and setting it as needed for the > next parallel region. > > There are a number of stores that could be relatively easily performed > in each secondary thread after it is released by the primary thread -- > removing this work from the serialised region. This patch proposes that > much simpler setup. This re-initialisation scales with number of > threads -- which is why it has such an impact on the problematic case I > have been looking at. > > Only subtlety here is that in order to ensure memory safety we need to > avoid performing this initialisation when `thr->fn == NULL`. If a > secondary non-nested thread gets "left to exit" when a new team is > started the primary thread could race ahead, cache the new team, and > free the team that this secondary thread was using all before the > secondary thread goes through the new initialisation code. That could > lead to a use-after-free. Hence we have to early-exit when this thread > is not going to perform any more work. > - N.b. When running TSAN before and after I realised that this change > happens to remove a race between `gomp_finish_task` in the secondary > thread and `gomp_init_task` in the primary thread (when re-using a > team and hence `nthr->task` is in the same place as the threads local > `task` variable). > > All threads that are not going to be used in the next region do > unnecessarily re-init their `team` data structure etc, but since > `thr->fn` is NULL they exit the do/while loop before that makes little > difference, and the work is spread across threads so it shouldn't block > other useful work. > > I imagine there is a good chunk of performance to be gained in adding > more logic here: > 1) Something recording whether the affinity setup has changed since the > last non-nested team and if not removing the calculation of places > and assignment to `nthr->ts.place_partition_*` variables. > 2) I could move more members that are assigned multiple times to `team` > for each secondary thread to take. > - `data` pointer. > - `num_teams`. > - `team_num`. > - Honestly should have looked into this in the patch -- noticed it > while writing this cover letter and will look into whether this is > feasible relatively soon. > 3) If we can identify that we're re-using a team from the last parallel > region (and affinity ICV has not changed) it seems that we could > avoid re-initialising some of its fields: > - ordered_release[i] should already point to `nthr`? > - ts.team_id should already be set. > Overall if we can identify that we're using the same team as was > cached and we don't need to change anything we should be able to get > away with drastically less work in the serial part of the call to > GOMP_parallel. > > I did run the micro-benchmark I'm looking at (mentioned in PR119588) on > some hacked up versions of the above and they seemed to provide > measurable gains. I haven't tried to put them all in this patch due to > lack of time and the relatively higher complexity of the bookkeeping. > I'm hoping this is a straight-forward step in the right direction. > > ------- > Testing done (note -- much of the testing done on top of my other > patches rather than by itself): > - Bootstrap & regtest on x86_64 and aarch64 > - Testsuite ran with OMP_WAIT_POLICY=PASSIVE as well. > - Also done with `--enable-linux-futex=no` for posix/ target. > - Also done with `_LIBGOMP_CHECKING_` set for assertions. > - Cross compiler & regtest on arm (qemu). > - TSAN done combining all my other patches upstream. > > libgomp/ChangeLog: > > PR libgomp/119588 > * libgomp.h (struct gomp_team): Add fields for communication > at team start time. > * team.c (gomp_thread_start): Initialise thread-local data using > members stored on team. > (gomp_team_start): Primary thread stores data on team structure > instead of copying that data to each secondary threads > gomp_thread structure. > > Signed-off-by: Matthew Malcomson <[email protected]> > --- > libgomp/libgomp.h | 8 ++++++++ > libgomp/team.c | 37 ++++++++++++++++++++++++++----------- > 2 files changed, 34 insertions(+), 11 deletions(-) I've actually made pretty much this same change for the GCN configuration of libgomp, resulting in a semantic conflict with this patch (yet to upstream it, though). It may make sense to extract the common thread initialization (including from this cache) into a function so that it can be used in other libgomp configurations. > diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h > index 4269a3f4e60..b188567ca28 100644 > --- a/libgomp/libgomp.h > +++ b/libgomp/libgomp.h > @@ -866,6 +866,14 @@ struct gomp_team > /* Number of tasks waiting for their completion event to be fulfilled. */ > unsigned int task_detach_count; > > + /* Cache to help initialise each secondary threads `task` when re-using a > + team very many times. This allows each secondary thread to perform the > + re-initialisation instead of having the primary thread need to perform > + that initialisation. */ > + struct gomp_task_icv task_icv_init; > + struct gomp_task *task_parent_init; > + struct gomp_taskgroup *task_taskgroup; > + > /* This array contains structures for implicit tasks. */ > struct gomp_task implicit_task[]; > }; > diff --git a/libgomp/team.c b/libgomp/team.c > index 30aefcd22b6..0553b8d83a3 100644 > --- a/libgomp/team.c > +++ b/libgomp/team.c > @@ -158,6 +158,23 @@ gomp_thread_start (void *xdata) > if (!gomp_barrier_can_hold (&team->barrier) || thr->fn == NULL) > gomp_simple_barrier_wait (&pool->threads_dock); > > + /* Early exit because if `thr->fn == NULL` then we can't guarantee > + the `thr->ts.team` hasn't been freed by the primary thread racing > + ahead. Hence we don't want to write to it. */ > + if (thr->fn == NULL) > + break; > + team = thr->ts.team; > + thr->ts.work_share = &team->work_shares[0]; > + thr->ts.last_work_share = NULL; > +#ifdef HAVE_SYNC_BUILTINS > + thr->ts.single_count = 0; > +#endif > + thr->ts.static_trip = 0; > + thr->task = &team->implicit_task[thr->ts.team_id]; > + gomp_init_task (thr->task, team->task_parent_init, > + &team->task_icv_init); > + thr->task->taskgroup = team->task_taskgroup; > + > local_fn = thr->fn; > local_data = thr->data; > thr->fn = NULL; > @@ -437,6 +454,15 @@ gomp_team_start (void (*fn) (void *), void *data, > unsigned nthreads, > thr->task->taskgroup = taskgroup; > team->implicit_task[0].icv.nthreads_var = nthreads_var; > team->implicit_task[0].icv.bind_var = bind_var; > + /* Copy entirely rather than copy pointer so that we have an immutable copy > + for the secondary threads to take from for the time-frame that said > + secondary threads are executing on. Can't copy from > + `team->implicit_task[0].icv` because primary thread could have adjusted > + its per-task ICV by the time the secondary thread gets around to > + initialising things. */ > + team->task_icv_init = team->implicit_task[0].icv; > + team->task_parent_init = task; > + team->task_taskgroup = taskgroup; > > if (nthreads == 1) > return; > @@ -725,26 +751,15 @@ gomp_team_start (void (*fn) (void *), void *data, > unsigned nthreads, > else > nthr = pool->threads[i]; > __atomic_store_n (&nthr->ts.team, team, MEMMODEL_RELEASE); > - nthr->ts.work_share = &team->work_shares[0]; > - nthr->ts.last_work_share = NULL; > nthr->ts.team_id = i; > nthr->ts.level = team->prev_ts.level + 1; > nthr->ts.active_level = thr->ts.active_level; > nthr->ts.place_partition_off = place_partition_off; > nthr->ts.place_partition_len = place_partition_len; > nthr->ts.def_allocator = thr->ts.def_allocator; > -#ifdef HAVE_SYNC_BUILTINS > - nthr->ts.single_count = 0; > -#endif > - nthr->ts.static_trip = 0; > nthr->num_teams = thr->num_teams; > nthr->team_num = thr->team_num; > - nthr->task = &team->implicit_task[i]; > nthr->place = place; > - gomp_init_task (nthr->task, task, icv); > - team->implicit_task[i].icv.nthreads_var = nthreads_var; > - team->implicit_task[i].icv.bind_var = bind_var; > - nthr->task->taskgroup = taskgroup; > nthr->fn = fn; > nthr->data = data; > team->ordered_release[i] = &nthr->release; -- Arsen Arsenović
signature.asc
Description: PGP signature
