Hello, I asked about this part way through a large comment on PR119588, but doubt it was very obvious so asking the question in its own email in order to highlight it. (Link to the long comment for reference and some timings of a very hacky patch I made to identify where the performance opportunities are https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119588#c2).
Background is that we've seen some significant performance benefits of clang over GCC when performing a many small pieces of work with a large thread count. In a program which has both some work that can be handled with high parallelism (so OMP is running with many threads) and a large number of small pieces of work that need to be performed with low overhead, the overhead of creating multiple `parallel` regions back-to-back has been seen to cause a significant overhead when accumulated. ------------------------------ I originally thought that the majority of the performance overhead could be attributed to a less-optimised barrier. While working on optimising our barrier implementation I found I overestimated its impact and it only accounts for about 1/2 of the performance loss that we see vs LLVM. One of the remaining contributions to our overhead in this case is the re-initialisation of a team even when we're re-using a cached team structure. In `gomp_team_start`, in the non-nested case, there is a loop over each existing thread initialising it according to the number of teams. I'm hoping that when we're re-using a team and its threads some of this re-initialisation is unnecessary -- it's precisely the case where re-using a team and not changing the affinity or omp_proc_bind that we've been seeing this difference w.r.t. clang. One of the main things contributions to this overhead is the `gomp_init_task` function. I am by no means confident (not yet having fully understood the task.c code) but I have the impression that maybe this initialisation to clear values is not necessary for a team that is being re-used? Alternatively, if it is necessary then maybe I could make each thread clear its own task data structure in `gomp_finish_task` so that the initialisation doesn't need to all be done in the primary thread? Anything else I can remove from the `for (; i < n; +<+i)` loop in `gomp_team_start` would be great too -- so would appreciate any pointers towards things that we don't need to initialise for a thread that is getting re-used. What do people think about the feasibility of looking into here? I'm looking to ensure I don't waste my time attempting to optimise this if people know this is not a sensible approach. Thanks, Matthew