[RFC] Enabling SVE with offloading to nvptx

Prathamesh Kulkarni via Gcc Tue, 15 Oct 2024 08:37:13 -0700

Hi,
Testing libgomp with SVE enabled (-mcpu=generic+sve2), results in ~60 
UNRESOLVED errors with following error message:


lto1: fatal error: degree of 'poly_int' exceeds 'NUM_POLY_INT_COEFFS'
compilation terminated.
nvptx mkoffload: fatal error: 
../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 
exit status
compilation terminated. 

This behaviour can be reproduced with the following simple test-case with 
-fopenmp -foffload=nvptx-none -mcpu=generic+sve2:

#define N 1000
int main ()
{
  int i;
  int A[N] = {0}, B[N] = {0};

  #pragma omp target map(i), map(tofrom: A), map(from: B)
  #pragma omp simd
  for (i = 0; i < N; i++)
    A[i] = A[i] + B[i];
  return A[0];
}

omplower pass lowers the above loop to the following:

                D.4576 = .GOMP_USE_SIMT ();
                if (D.4576 != 0) goto <D.4577>; else goto <D.4578>;
                <D.4577>:
                {
                  unsigned int D.4586;
                  unsigned int D.4587;
                  int D.4588;
                  void * simduid.5;
                  void * .omp_simt.6;
                  int D.4596;
                  _Bool D.4597;
                  int D.4598;
                  unsigned int D.4599;
                  int D.4600;
                  int D.4601;
                  int * D.4602;
                  int i [value-expr: D.4588];
                  int i.0;

                  simduid.5 = .GOMP_SIMT_ENTER (simduid.5, &D.4588);
                  .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC (simduid.5);
                  D.4587 = 0;
                  i.0 = 0;
                  #pragma omp simd safelen(32) _simduid_(simduid.5) _simt_ 
linear(i.0:1) linear(i:1)
                  for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1)
                  ...
                }
                goto <D.4579>;
                <D.4578>:
                {
                  unsigned int D.4603;
                  unsigned int D.4604;
                  int D.4605[0:POLY_INT_CST [15, 16]];
                  void * simduid.7;
                  unsigned int D.4612;
                  int * D.4613;
                  int D.4614;
                  int i [value-expr: D.4605[D.4604]];
                  int i.0;

                  D.4604 = 0;
                  i.0 = 0;
                  #pragma omp simd safelen(POLY_INT_CST [16, 16]) 
_simduid_(simduid.7) linear(i.0:1) linear(i:1)
                  ...
                 }
                 <D.4579>:
                 ...

For offloading to SIMT based device like nvptx, scan_omp_simd duplicates 
lowering of simd pragma into if-else where the if-part contains simt code-path,
and else-part contains simd code-path. In lower_rec_simd_input_clauses, max_vf 
is set to 16+16x for the above case as determined by omp_max_vf,
and that becomes length of the omp simd array:
int D.4605[0:POLY_INT_CST [15, 16]];

The issue here is that, the function containing above if-else condition gets 
streamed out to LTO bytecode including the simd code-path and the omp simd 
array, 
whose domain is [0:POLY_INT_CST[15, 16]], and thus we get the above error while 
streaming-in POLY_INT_CST in lto_input_ts_poly_tree_pointers on device side.

Note that, the simd code-path is essentially dead-code on nvptx, since 
.GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and later 
optimization passes (ccp2)
remove the dead-code path and unused omp simd arrays while compiling to device. 
So in this case, we aren't really mapping POLY_INT_CST from host to device,
but it gets streamed out to device as an artefact of omp simd lowering.

I suppose a proper fix here would be to (somehow) defer lowering of omp pragma 
simd after streaming out to device, so the device only sees simt code-path,
and the host only sees simd code path ? Or perhaps clone each function in 
offload region, one for host and one for SIMT device, and only stream the 
device versions
to avoid streaming out host-specific IR changes ?

I thought of following approaches as workarounds:

[1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) in 
lower_rec_simd_input_clauses, if the function is going to be offloaded
and omp_max_vf returns non-constant poly_int. For above case, it sets max_vf to 
16 instead of 16+16x which seems to resolve the issue,
but it'd use suboptimal max VF for host ? This is done in patch p-283-2.txt.

However, with clean trunk it still seems to use max_vf = 16 after disabling the 
above error.
vect dump shows:

(compute_affine_dependence
  ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51];
  ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27;
) -> dependence analysis failed
foo.c:10:13: note:   dependence distance  = 0.
foo.c:10:13: note:   dependence distance == 0 between (*_23)[i.0_51] and 
(*_23)[i.0_51]
foo.c:10:13: missed:  bad data dependence.
foo.c:10:13: note:  ***** Analysis failed with vector mode VNx4SI

This seems to happen because, loop->safelen is set to 16 by taking
MIN(constant_lower_bound(16+16x), INT_MAX) in expand_omp_simd:

      if (!poly_int_tree_p (safelen, &val))
        safelen_int = 0;
      else
        safelen_int = MIN (constant_lower_bound (val), INT_MAX); 

and fails to vectorize with VLA vectors, because max_vf == 16 and min_vf == 
4+4x resulting in bad data dependence due to:

  if (max_vf != MAX_VECTORIZATION_FACTOR
      && maybe_lt (max_vf, min_vf))
    return opt_result::failure_at (vect_location, "bad data dependence.\n");

If safelen was (somehow) set to 16+16x, I guess it could have used VF=4+4x and 
vectorized with VLA vectors.
but I suppose that's a separate issue ?

[2] Since the issue seems to be only with streaming out length of omp simd 
array when it's POLY_INT_CST, could we perhaps use a place holder length
during omp lowering and compute the correct length after streaming out, so 
POLY_INT_CST doesn't get leaked into bytecode ? The attached patch
p-283-3.txt follows this approach by using bogus length INT_MAX in 
lower_rec_simd_input_clauses if offloading to SIMT device and max_vf is 
non-constant
poly_int, and later computing the correct length in beginning of vect pass by 
setting it to omp_max_vf (), but I am not sure if this is entirely correct.
I am assuming that creating omp simd array of bogus length will not be an issue 
for nvptx since it will never get referenced and eventually be removed
by remove_unused_locals ? If it'd not be a good idea to rely on the pass 
pipeline to eliminate simd code-path and omp simd array while compiling to 
device,
it could be possibly done during omp_lower_device pass itself ?

[3] While streaming-in POLY_INT_CST, avoid emitting error immediately if degree 
of POLY_INT_CST exceeds accel's NUM_POLY_INT_COEFFS to ignore POLY_INT_CSTs that
may potentially occur on dead-code path, and instead mark it as 
error_mark_node. For the above case, since POLY_INT_CST appears on dead-code 
path, streaming
POLY_INT_CST with higher degree than accel's NUM_POLY_INT_COEFFS would be 
"harmless". And detect invalid POLY_INT_CST's in expand pass (if it survives 
till this
point), and emit above error, but not sure if that'd be the right place ?
This is done in p-283-4.txt.

All the three patches fix UNRESOLVED tests due to POLY_INT_CST streaming error 
in libgomp testsuite with -mcpu=generic+sve2.
(Altho it introduces a strange FAIL for data-5.f90, which I am investigating).
I would be grateful for suggestions on how to proceed.

Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com>

Thanks,
Prathamesh

Set max_vf to constant lower bound if max_vf is poly_int and offloading to 
device. 

gcc/
        * omp-low.cc (lower_rec_simd_input_clauses): Set max_vf to
        constant_lower_bound (omp_max_vf()) if offloading is enabled and
        max_vf is POLY_INT_CST.

Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com>

diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index da2051b0279..780ea396b7f 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4588,7 +4588,20 @@ lower_rec_simd_input_clauses (tree new_var, omp_context 
*ctx,
 {
   if (known_eq (sctx->max_vf, 0U))
     {
-      sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf ();
+      if (sctx->is_simt)
+       sctx->max_vf = omp_max_simt_vf ();
+      else
+       {
+         poly_uint64 max_vf = omp_max_vf ();
+         /* FIXME: If the function is going to be offloaded,
+            and max_vf is poly_int, use constant_lower_bound as safelen,
+            to avoid streaming out omp simd arrays having poly_int_cst
+            size.  */
+         if (omp_maybe_offloaded_ctx (ctx)
+             && !max_vf.is_constant ())
+           max_vf = constant_lower_bound (max_vf);
+         sctx->max_vf = max_vf;
+       }
       if (maybe_gt (sctx->max_vf, 1U))
        {
          tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt),

Use a bogus length for omp simd array and fix it up later when offloading to 
SIMT device.

gcc/ChangeLog:

        * omp-low.cc (lower_rec_simd_input_clauses): Create a bogus length
        INT_MAX for omp simd array if max_vf is POLY_INT_CST and offloading to
        SIMT based device.
        * tree-vectorizer.cc: Include omp-general.h.
        (fixup_omp_simd_array_len): New function.
        (pass_vectorize::execute): Call fixup_omp_simd_array if function is
        offloaded.

Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com>

diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index da2051b0279..1a8bf0b215c 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4658,7 +4658,21 @@ lower_rec_simd_input_clauses (tree new_var, omp_context 
*ctx,
     }
   else
     {
-      tree atype = build_array_type_nelts (TREE_TYPE (new_var), sctx->max_vf);
+      /* FIXME: When offloading to SIMT based device, choose a bogus length
+        for omp simd array to avoid streaming out max_vf if it's
+        POLY_INT_CST. Later in vect pass, the length is set to max_vf
+        in fixup_omp_simd_array_len.
+
+        Creating a omp simd array with bogus len seems fine on SIMT device
+        since simd code-path will be dead-code on device, and the array
+        will never actually be referenced.  */
+      poly_uint64 nelts = sctx->max_vf;
+      if (omp_maybe_offloaded_ctx (ctx)
+         && !nelts.is_constant ()
+         && omp_max_simt_vf () > 0)
+       nelts = INT_MAX;
+
+      tree atype = build_array_type_nelts (TREE_TYPE (new_var), nelts);
       tree avar = create_tmp_var_raw (atype);
       if (TREE_ADDRESSABLE (new_var))
        TREE_ADDRESSABLE (avar) = 1;
diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
index d4ab47349a3..0635d22074b 100644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "internal-fn.h"
 #include "tree-ssa-sccvn.h"
 #include "tree-into-ssa.h"
+#include "omp-general.h"
 
 /* Loop or bb location, with hotness information.  */
 dump_user_location_t vect_location;
@@ -457,6 +458,33 @@ shrink_simd_arrays
 
   delete simd_array_to_simduid_htab;
 }
+
+/* Compute correct length for omp simd array.  */
+
+static void
+fixup_omp_simd_array_len (function *fun)
+{
+  /* Look for omp simd arrays whose length is set to bogus INT_MAX value
+     during omp lowering, and set it to max_vf.  */
+
+  poly_uint64 max_vf = omp_max_vf ();
+  if (max_vf.is_constant ())
+    return;
+
+  for (auto decl: fun->local_decls)
+    if (TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE
+       && lookup_attribute ("omp simd array", DECL_ATTRIBUTES (decl)))
+      {
+       tree& max = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (decl)));
+       if (TREE_CODE (max) == INTEGER_CST
+           && wi::eq_p (wi::to_widest (max), INT_MAX - 1))
+         {
+           max = size_int (max_vf - 1);
+           relayout_decl (decl);
+         }
+      }
+}
+
 
 /* Initialize the vec_info with kind KIND_IN and target cost data
    TARGET_COST_DATA_IN.  */
@@ -1247,7 +1275,11 @@ pass_vectorize::execute (function *fun)
   vect_slp_init ();
 
   if (fun->has_simduid_loops)
-    note_simd_array_uses (&simd_array_to_simduid_htab, fun);
+    {
+      if (offloading_function_p (fun->decl))
+       fixup_omp_simd_array_len (fun);
+      note_simd_array_uses (&simd_array_to_simduid_htab, fun);
+    }
 
   /*  ----------- Analyze loops. -----------  */

Defer emitting error for streaming POLY_INT_CST if it's degree exceeds accel's 
NUM_POLY_INT_COEFFS.

gcc/ChangeLog:

        * tree-streamer-in.cc (lto_input_ts_poly_tree_pointers): Avoid emitting
        fatal_error and instead mark TREE_TYPE (expr) as error_mark_node.
        * cfgexpand.cc (expand_debug_expr): If exp is POLY_INT_CST and marked
        as error operand, emit fatal error.  

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 6c1096363af..1d29b36b0ac 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -4595,6 +4595,12 @@ expand_debug_expr (tree exp)
       return op0;
 
     case POLY_INT_CST:
+#ifdef ACCEL_COMPILER
+      if (error_operand_p (exp))
+       fatal_error (input_location,
+                    "degree of %<poly_int%> exceeds "
+                    "%<NUM_POLY_INT_COEFFS%>");
+#endif
       return immed_wide_int_const (poly_int_cst_value (exp), mode);
 
     case COMPLEX_CST:
diff --git a/gcc/tree-streamer-in.cc b/gcc/tree-streamer-in.cc
index 329d218e7d4..7f450a9e91e 100644
--- a/gcc/tree-streamer-in.cc
+++ b/gcc/tree-streamer-in.cc
@@ -708,10 +708,11 @@ lto_input_ts_poly_tree_pointers (class lto_input_block 
*ib,
       for (; i < num_poly_int_coeffs; i++)
        {
          tree val = stream_read_tree_ref (ib, data_in);
+         /* FIXME: Defer emitting error immediately if degree of poly_int
+            exceeds accel's NUM_POLY_INT_COEFFS to ignore POLY_INT_CST's
+            that occur on dead-code path.  */
          if (!integer_zerop (val))
-           fatal_error (input_location,
-                        "degree of %<poly_int%> exceeds "
-                        "%<NUM_POLY_INT_COEFFS%>");
+           TREE_TYPE (expr) = error_mark_node;
        }
     }
 }

[RFC] Enabling SVE with offloading to nvptx

Reply via email to