On 03/22/2018 04:59 AM, Cesar Philippidis wrote:
On 03/21/2018 10:10 AM, Tom de Vries wrote:
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
have been extended to take a barrier ID and a thread count. The idea
here is to assign one barrier for each logical vector. Worker-single
synchronization is controlled by barrier 0. Therefore, the vector
barrier ID is set to tid.y+1 (because there's one vector unit per
worker) in nvptx_init_oacc_workers and placed into a register stored in
cfun->machine->sync_bar. If no workers are present, then the barrier ID
falls back to 0.
I compiled a worker loop before and after the patch series, and observed
this change:
...
@@ -70,7 +71,7 @@
$L2:
// joining 2;
$L5:
- bar.sync 1;
+ bar.sync 0;
// join 2;
ret;
}
...
AFAICT from your explanation above, that change is intentional.
Changing the code generation scheme for workers is fine, but obviously
that should be a minimal, separate patch that we can bisect back to.
That sounds reasonable. I'll apply this patch to og7 once testing has
completed. While all of the functionality it introduces is unnecessary
In other words, the patch is not minimal.
Thanks,
- Tom
without the vector length changes, at least it can be applied independently.
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b7e3f59..16d846e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3936,13 +3936,13 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
return empty;
}
-/* Emit a CTA-level synchronization barrier. We use different
- markers for before and after synchronizations. */
+/* Emit a CTA-level synchronization barrier (bar.sync). LOCK is the
+ barrier number, which is an integer or a register. */
static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock)
{
- return gen_nvptx_barsync (GEN_INT (after));
+ return gen_nvptx_barsync (lock);
}
#if WORKAROUND_PTXJIT_BUG
@@ -4192,6 +4192,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
/* Includes worker mode, do spill & fill. By construction
we should never have worker mode only. */
broadcast_data_t data;
+ rtx barrier = GEN_INT (0);
data.base = oacc_bcast_sym;
data.ptr = 0;
@@ -4204,14 +4205,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
false),
before);
/* Barrier so other workers can see the write. */
- emit_insn_before (nvptx_cta_sync (false), tail);
+ emit_insn_before (nvptx_cta_sync (barrier), tail);
data.offset = 0;
emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
false), tail);
/* This barrier is needed to avoid worker zero clobbering
the broadcast buffer before all the other workers have
had a chance to read this instance of it. */
- emit_insn_before (nvptx_cta_sync (true), tail);
+ emit_insn_before (nvptx_cta_sync (barrier), tail);
}
extract_insn (tail);
@@ -4328,12 +4329,13 @@ nvptx_process_pars (parallel *par)
bool empty = nvptx_shared_propagate (true, is_call,
par->forked_block, par->fork_insn,
false);
+ rtx barrier = GEN_INT (0);
if (!empty || !is_call)
{
/* Insert begin and end synchronizations. */
- emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
- emit_insn_before (nvptx_cta_sync (true), par->join_insn);
+ emit_insn_before (nvptx_cta_sync (barrier), par->forked_insn);
+ emit_insn_before (nvptx_cta_sync (barrier), par->join_insn);
}
}
else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))