Ping x2.
On 2022/10/17 10:29 PM, Chung-Lin Tang wrote: > Ping. > > On 2022/9/21 3:45 PM, Chung-Lin Tang via Gcc-patches wrote: >> Hi Tom, >> I had a patch submitted earlier, where I reported that the current way of >> implementing >> barriers in libgomp on nvptx created a quite significant performance drop on >> some SPEChpc2021 >> benchmarks: >> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html >> >> That previous patch wasn't accepted well (admittedly, it was kind of a hack). >> So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX. >> >> Basically, instead of trying to have the GPU do CPU-with-OS-like things that >> it isn't suited for, >> barriers are implemented simplistically with bar.* synchronization >> instructions. >> Tasks are processed after threads have joined, and only if team->task_count >> != 0 >> >> (arguably, there might be a little bit of performance forfeited where >> earlier arriving threads >> could've been used to process tasks ahead of other threads. But that again >> falls into requiring >> implementing complex futex-wait/wake like behavior. Really, that kind of >> tasking is not what target >> offloading is usually used for) >> >> Implementation highlight notes: >> 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" >> in the usual manner) >> 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. >> 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" >> >> 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): >> The main synchronization is done using a 'bar.red' instruction. This >> reduces across all threads >> the condition (team->task_count != 0), to enable the task processing >> down below if any thread >> created a task. (this bar.red usage required the need of the second GCC >> patch in this series) >> >> This patch has been tested on x86_64/powerpc64le with nvptx offloading, >> using libgomp, ovo, omptests, >> and sollve_vv testsuites, all without regressions. Also verified that the >> SPEChpc 2021 521.miniswp_t >> and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle >> has been restored to >> devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? >> >> (also suggest backporting to GCC12 branch, if performance regression can be >> considered a defect) >> >> Thanks, >> Chung-Lin >> >> libgomp/ChangeLog: >> >> 2022-09-21 Chung-Lin Tang <clt...@codesourcery.com> >> >> * config/nvptx/bar.c (generation_to_barrier): Remove. >> (futex_wait,futex_wake,do_spin,do_wait): Remove. >> (GOMP_WAIT_H): Remove. >> (#include "../linux/bar.c"): Remove. >> (gomp_barrier_wait_end): New function. >> (gomp_barrier_wait): Likewise. >> (gomp_barrier_wait_last): Likewise. >> (gomp_team_barrier_wait_end): Likewise. >> (gomp_team_barrier_wait): Likewise. >> (gomp_team_barrier_wait_final): Likewise. >> (gomp_team_barrier_wait_cancel_end): Likewise. >> (gomp_team_barrier_wait_cancel): Likewise. >> (gomp_team_barrier_cancel): Likewise. >> * config/nvptx/bar.h (gomp_team_barrier_wake): Remove >> prototype, add new static inline function.