Hello,

I asked about this part way through a large comment on PR119588, but
doubt it was very obvious so asking the question in its own email in
order to highlight it. (Link to the long comment for reference and some
timings of a very hacky patch I made to identify where the performance
opportunities are
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119588#c2).

Background is that we've seen some significant performance benefits of
clang over GCC when performing a many small pieces of work with a large
thread count.  In a program which has both some work that can be
handled with high parallelism (so OMP is running with many threads) and
a large number of small pieces of work that need to be performed with
low overhead, the overhead of creating multiple `parallel` regions
back-to-back has been seen to cause a significant overhead when
accumulated.

------------------------------
I originally thought that the majority of the performance overhead could
be attributed to a less-optimised barrier.  While working on optimising
our barrier implementation I found I overestimated its impact and it
only accounts for about 1/2 of the performance loss that we see vs LLVM.

One of the remaining contributions to our overhead in this case is the
re-initialisation of a team even when we're re-using a cached team
structure.  In `gomp_team_start`, in the non-nested case, there is a
loop over each existing thread initialising it according to the number
of teams.  I'm hoping that when we're re-using a team and its threads
some of this re-initialisation is unnecessary -- it's precisely the case
where re-using a team and not changing the affinity or omp_proc_bind
that we've been seeing this difference w.r.t. clang.

One of the main things contributions to this overhead is the
`gomp_init_task` function.  I am by no means confident (not yet having
fully understood the task.c code) but I have the impression that maybe
this initialisation to clear values is not necessary for a team that is
being re-used?  Alternatively, if it is necessary then maybe I could
make each thread clear its own task data structure in
`gomp_finish_task` so that the initialisation doesn't need to all be
done in the primary thread?
Anything else I can remove from the `for (; i < n; +<+i)` loop in
`gomp_team_start` would be great too -- so would appreciate any
pointers towards things that we don't need to initialise for a thread
that is getting re-used.

What do people think about the feasibility of looking into here?  I'm
looking to ensure I don't waste my time attempting to optimise this if
people know this is not a sensible approach.

Thanks,
Matthew

Reply via email to