https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85381
--- Comment #7 from Tom de Vries <vries at gcc dot gnu.org> --- For this example: ... #define n 1024 int main (void) { #pragma acc parallel vector_length(128) { #pragma acc loop vector for (int i = 0; i < n; i++) ; #pragma acc loop vector for (int i = 0; i < n; i++) ; } return 0; } ... we currently generate: ... .entry main$_omp_fn$0 { .reg.u64 %r24; .reg.u64 %r25; .reg.u64 %r26; .reg.u64 %r27; .reg.pred %r28; { .reg.u32 %x; mov.u32 %x, %tid.x; setp.ne.u32 %r28, %x, 0; } bar.sync 0; @%r28 bra $L2; // join 4; // fork 4; $L2: bar.sync 0; ret; } ... so if we fix the branch around nothing problem here, we'll get back-to-back bar.syncs again, and may run into the JIT but again. We may wanna insert dummy ops inbetween (it would be nice if something less heavy than a membar.cta will work).