[gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

Alexander Monakov Tue, 01 Dec 2015 07:29:48 -0800

This patch introduces a code generation variant for NVPTX that I'm using for
SIMD work in OpenMP offloading.  Let me try to explain the idea behind it...


In place of SIMD vectorization, NVPTX is using SIMT (single
instruction/multiple threads) execution: groups of 32 threads execute the same
instruction, with some threads possibly masked off if under a divergent branch.
So we are mapping OpenMP threads to such thread groups ("warps"), and hardware
threads are then mapped to OpenMP SIMD lanes.

We need to reach heads of SIMD regions with all hw threads active, because
there's no way to "resurrect" them once masked off: they need to follow the
same control flow, and reach the SIMD region entry with the same local state
(registers, and stack too for OpenACC).

The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
"slaves" which just follow branches without any computation -- that requires
extra jumps and broadcasting branch predicates, -- and 2) broadcast register
state and stack state from master to slaves when entering "vector" regions.

I'm taking a different approach.  I want to execute all insns in all warp
members, while ensuring that effect (on global and local state) is that same
as if any single thread was executing that instruction.  Most instructions
automatically satisfy that: if threads have the same state, then executing an
arithmetic instruction, normal memory load/store, etc. keep local state the
same in all threads.

The two exception insn categories are atomics and calls.  For calls, we can
demand recursively that they uphold this execution model, until we reach
runtime-provided "syscalls": malloc/free/vprintf.  Those we can handle like
atomics.

To handle atomics, we
  1) execute the atomic conditionally only in one warp member -- so its side
  effect happens once;
  2) copy the register that was set from that warp member to others -- so
  local state is kept synchronized:

    atom.op dest, ...

becomes

    /* pred = (current_lane == 0);  */
    @pred atom.op dest, ...
    shuffle.idx dest, dest, /*srclane=*/0

So the overhead is one shuffle insn following each atomic, plus predicate
setup in the prologue.

OK, so the above handles execution out of SIMD regions nicely, but then we'd
also need to run code inside of SIMD regions, where we need to turn off this
synching effect.  Turns out we can keep atomics decorated almost like before:

    @pred atom.op dest, ...
    shuffle.idx dest, dest, master_lane

and compute 'pred' and 'master_lane' accordingly: outside of SIMD regions we
need (master_lane == 0 && pred == (current_lane == 0)), and inside we need
(master_lane == current_lane && pred == true) (so that shuffle is no-op, and
predicate is 'true' for all lanes).  Then, (pred = (current_lane ==
master_lane) works in both cases, and we just need to set up master_lane
accordingly: master_lane = current_lane & mask, where mask is all-0 outside of
SIMD regions, and all-1 inside.  To store these per-warp masks, I've
introduced another shared memory array, __nvptx_uni.

        * config/nvptx/nvptx.c (need_unisimt_decl): New variable.  Set it...
        (nvptx_init_unisimt_predicate): ...here (new function) and use it...
        (nvptx_file_end): ...here to emit declaration of __nvptx_uni array.
        (nvptx_declare_function_name): Call nvptx_init_unisimt_predicate.
        (nvptx_get_unisimt_master): New helper function.
        (nvptx_get_unisimt_predicate): Ditto.
        (nvptx_call_insn_is_syscall_p): Ditto.
        (nvptx_unisimt_handle_set): Ditto.
        (nvptx_reorg_uniform_simt): New.  Transform code for -muniform-simt.
        (nvptx_get_axis_predicate): New helper function, factored out from...
        (nvptx_single): ...here.
        (nvptx_reorg): Call nvptx_reorg_uniform_simt.
        * config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define
        __nvptx_unisimt__ when -muniform-simt option is active.
        (struct machine_function): Add unisimt_master, unisimt_predicate
        rtx fields.
        * config/nvptx/nvptx.md (divergent): New attribute.
        (atomic_compare_and_swap<mode>_1): Mark as divergent.
        (atomic_exchange<mode>): Ditto.
        (atomic_fetch_add<mode>): Ditto.
        (atomic_fetch_addsf): Ditto.
        (atomic_fetch_<logic><mode>): Ditto.
        * config/nvptx/nvptx.opt (muniform-simt): New option.
        * doc/invoke.texi (-muniform-simt): Document.
---
 gcc/config/nvptx/nvptx.c   | 138 ++++++++++++++++++++++++++++++++++++++++++---
 gcc/config/nvptx/nvptx.h   |   4 ++
 gcc/config/nvptx/nvptx.md  |  18 ++++--
 gcc/config/nvptx/nvptx.opt |   4 ++
 gcc/doc/invoke.texi        |  14 +++++
 5 files changed, 165 insertions(+), 13 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2dad3e2..9209b47 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -117,6 +117,9 @@ static GTY(()) rtx worker_red_sym;
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 
+/* True if any function references __nvptx_uni.  */
+static bool need_unisimt_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -599,6 +602,33 @@ nvptx_init_axis_predicate (FILE *file, int regno, const 
char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize predicate and master lane index registers for
+   -muniform-simt code generation variant.  */
+
+static void
+nvptx_init_unisimt_predicate (FILE *file)
+{
+  int bits = BITS_PER_WORD;
+  int master = REGNO (cfun->machine->unisimt_master);
+  int pred = REGNO (cfun->machine->unisimt_predicate);
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32 %%ustmp0;\n");
+  fprintf (file, "\t\t.reg.u%d %%ustmp1;\n", bits);
+  fprintf (file, "\t\t.reg.u%d %%ustmp2;\n", bits);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.y;\n");
+  fprintf (file, "\t\tmul%s.u32 %%ustmp1, %%ustmp0, 4;\n",
+          bits == 64 ? ".wide" : "");
+  fprintf (file, "\t\tmov.u%d %%ustmp2, __nvptx_uni;\n", bits);
+  fprintf (file, "\t\tadd.u%d %%ustmp2, %%ustmp2, %%ustmp1;\n", bits);
+  fprintf (file, "\t\tld.shared.u32 %%r%d, [%%ustmp2];\n", master);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.x;\n");
+  /* rNN = tid.x & __nvptx_uni[tid.y];  */
+  fprintf (file, "\t\tand.b32 %%r%d, %%r%d, %%ustmp0;\n", master, master);
+  fprintf (file, "\t\tsetp.eq.u32 %%r%d, %%r%d, %%ustmp0;\n", pred, master);
+  fprintf (file, "\t}\n");
+  need_unisimt_decl = true;
+}
+
 /* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
 
    extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
@@ -811,6 +841,8 @@ nvptx_declare_function_name (FILE *file, const char *name, 
const_tree decl)
   if (cfun->machine->axis_predicate[1])
     nvptx_init_axis_predicate (file,
                               REGNO (cfun->machine->axis_predicate[1]), "x");
+  if (cfun->machine->unisimt_predicate)
+    nvptx_init_unisimt_predicate (file);
 }
 
 /* Output a return instruction.  Also copy the return value to its outgoing
@@ -2394,6 +2426,86 @@ nvptx_reorg_subreg (void)
     }
 }
 
+/* Return a SImode "master lane index" register for uniform-simt, allocating on
+   first use.  */
+
+static rtx
+nvptx_get_unisimt_master ()
+{
+  rtx &master = cfun->machine->unisimt_master;
+  return master ? master : master = gen_reg_rtx (SImode);
+}
+
+/* Return a BImode "predicate" register for uniform-simt, similar to above.  */
+
+static rtx
+nvptx_get_unisimt_predicate ()
+{
+  rtx &pred = cfun->machine->unisimt_predicate;
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
+/* Return true if given call insn references one of the functions provided by
+   the CUDA runtime: malloc, free, vprintf.  */
+
+static bool
+nvptx_call_insn_is_syscall_p (rtx_insn *insn)
+{
+  rtx pat = PATTERN (insn);
+  gcc_checking_assert (GET_CODE (pat) == PARALLEL);
+  pat = XVECEXP (pat, 0, 0);
+  if (GET_CODE (pat) == SET)
+    pat = SET_SRC (pat);
+  gcc_checking_assert (GET_CODE (pat) == CALL
+                      && GET_CODE (XEXP (pat, 0)) == MEM);
+  rtx addr = XEXP (XEXP (pat, 0), 0);
+  if (GET_CODE (addr) != SYMBOL_REF)
+    return false;
+  const char *name = XSTR (addr, 0);
+  return (!strcmp (name, "vprintf")
+         || !strcmp (name, "__nvptx_real_malloc")
+         || !strcmp (name, "__nvptx_real_free"));
+}
+
+/* If SET subexpression of INSN sets a register, emit a shuffle instruction to
+   propagate its value from lane MASTER to current lane.  */
+
+static void
+nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
+{
+  rtx reg;
+  if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
+    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
+}
+
+/* Adjust code for uniform-simt code generation variant by making atomics and
+   "syscalls" conditionally executed, and inserting shuffle-based propagation
+   for registers being set.  */
+
+static void
+nvptx_reorg_uniform_simt ()
+{
+  rtx_insn *insn, *next;
+
+  for (insn = get_insns (); insn; insn = next)
+    {
+      next = NEXT_INSN (insn);
+      if (!(CALL_P (insn) && nvptx_call_insn_is_syscall_p (insn))
+         && !(NONJUMP_INSN_P (insn)
+              && GET_CODE (PATTERN (insn)) == PARALLEL
+              && get_attr_divergent (insn)))
+       continue;
+      rtx pat = PATTERN (insn);
+      rtx master = nvptx_get_unisimt_master ();
+      for (int i = 0; i < XVECLEN (pat, 0); i++)
+       nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+      rtx pred = nvptx_get_unisimt_predicate ();
+      pred = gen_rtx_NE (BImode, pred, const0_rtx);
+      pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
+      validate_change (insn, &PATTERN (insn), pat, false);
+    }
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -2872,6 +2984,15 @@ nvptx_wsync (bool after)
   return gen_nvptx_barsync (GEN_INT (after));
 }
 
+/* Return a BImode "axis predicate" register, allocating on first use.  */
+
+static rtx
+nvptx_get_axis_predicate (int axis)
+{
+  rtx &pred = cfun->machine->axis_predicate[axis];
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
 /* Single neutering according to MASK.  FROM is the incoming block and
    TO is the outgoing block.  These may be the same block. Insert at
    start of FROM:
@@ -2956,14 +3077,7 @@ nvptx_single (unsigned mask, basic_block from, 
basic_block to)
     if (GOMP_DIM_MASK (mode) & skip_mask)
       {
        rtx_code_label *label = gen_label_rtx ();
-       rtx pred = cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER];
-
-       if (!pred)
-         {
-           pred = gen_reg_rtx (BImode);
-           cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
-         }
-       
+       rtx pred = nvptx_get_axis_predicate (mode - GOMP_DIM_WORKER);
        rtx br;
        if (mode == GOMP_DIM_VECTOR)
          br = gen_br_true (pred, label);
@@ -3202,6 +3316,9 @@ nvptx_reorg (void)
   /* Replace subregs.  */
   nvptx_reorg_subreg ();
 
+  if (TARGET_UNIFORM_SIMT)
+    nvptx_reorg_uniform_simt ();
+
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -3379,6 +3496,11 @@ nvptx_file_end (void)
       fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n;",
               BITS_PER_WORD);
     }
+  if (need_unisimt_decl)
+    {
+      fprintf (asm_out_file, "// BEGIN GLOBAL VAR DECL: __nvptx_uni\n");
+      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n;");
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index db8e201..1c605df 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -33,6 +33,8 @@
       builtin_define ("__nvptx__");            \
       if (TARGET_SOFT_STACK)                   \
         builtin_define ("__nvptx_softstack__");        \
+      if (TARGET_UNIFORM_SIMT)                 \
+        builtin_define ("__nvptx_unisimt__");  \
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -234,6 +236,8 @@ struct GTY(()) machine_function
   int ret_reg_mode; /* machine_mode not defined yet. */
   int punning_buffer_size;
   rtx axis_predicate[2];
+  rtx unisimt_master;
+  rtx unisimt_predicate;
 };
 #endif
 
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 5ce7a89..f0fc02c 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -75,6 +75,9 @@ (define_c_enum "unspecv" [
 (define_attr "subregs_ok" "false,true"
   (const_string "false"))
 
+(define_attr "divergent" "false,true"
+  (const_string "false"))
+
 (define_predicate "nvptx_register_operand"
   (match_code "reg,subreg")
 {
@@ -1519,7 +1522,8 @@ (define_insn "atomic_compare_and_swap<mode>_1"
    (set (match_dup 1)
        (unspec_volatile:SDIM [(const_int 0)] UNSPECV_CAS))]
   ""
-  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;")
+  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_exchange<mode>"
   [(set (match_operand:SDIM 0 "nvptx_register_operand" "=R")   ;; output
@@ -1530,7 +1534,8 @@ (define_insn "atomic_exchange<mode>"
    (set (match_dup 1)
        (match_operand:SDIM 2 "nvptx_register_operand" "R"))]   ;; input
   ""
-  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;")
+  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_fetch_add<mode>"
   [(set (match_operand:SDIM 1 "memory_operand" "+m")
@@ -1542,7 +1547,8 @@ (define_insn "atomic_fetch_add<mode>"
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
        (match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_fetch_addsf"
   [(set (match_operand:SF 1 "memory_operand" "+m")
@@ -1554,7 +1560,8 @@ (define_insn "atomic_fetch_addsf"
    (set (match_operand:SF 0 "nvptx_register_operand" "=R")
        (match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_code_iterator any_logic [and ior xor])
 (define_code_attr logic [(and "and") (ior "or") (xor "xor")])
@@ -1570,7 +1577,8 @@ (define_insn "atomic_fetch_<logic><mode>"
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
        (match_dup 1))]
   "0"
-  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
+  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "nvptx_barsync"
   [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 7ab09b9..47e811e 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -32,3 +32,7 @@ Link in code for a __main kernel.
 msoft-stack
 Target Report Mask(SOFT_STACK)
 Use custom stacks instead of local memory for automatic storage.
+
+muniform-simt
+Target Report Mask(UNIFORM_SIMT)
+Generate code that executes all threads in a warp as if one was active.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 6e45fb6..46cd2e9 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -18942,6 +18942,20 @@ in shared memory array @code{char *__nvptx_stacks[]} 
at position @code{tid.y}
 as the stack pointer.  This is for placing automatic variables into storage
 that can be accessed from other threads, or modified with atomic instructions.
 
+@item -muniform-simt
+@opindex muniform-simt
+Switch to code generation variant that allows to execute all threads in each
+warp, while maintaining memory state and side effects as if only one thread
+in each warp was active outside of OpenMP SIMD regions.  All atomic operations
+and calls to runtime (malloc, free, vprintf) are conditionally executed (iff
+current lane index equals the master lane index), and the register being
+assigned is copied via a shuffle instruction from the master lane.  Outside of
+SIMD regions lane 0 is the master; inside, each thread sees itself as the
+master.  Shared memory array @code{int __nvptx_uni[]} stores all-zeros or
+all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD
+regions).  Each thread can bitwise-and the bitmask at position @code{tid.y}
+with current lane index to compute the master lane index.
+
 @end table
 
 @node PDP-11 Options

[gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

Reply via email to