Hi, Testing libgomp with SVE enabled (-mcpu=generic+sve2), results in ~60 UNRESOLVED errors with following error message:
lto1: fatal error: degree of 'poly_int' exceeds 'NUM_POLY_INT_COEFFS' compilation terminated. nvptx mkoffload: fatal error: ../../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit status compilation terminated. This behaviour can be reproduced with the following simple test-case with -fopenmp -foffload=nvptx-none -mcpu=generic+sve2: #define N 1000 int main () { int i; int A[N] = {0}, B[N] = {0}; #pragma omp target map(i), map(tofrom: A), map(from: B) #pragma omp simd for (i = 0; i < N; i++) A[i] = A[i] + B[i]; return A[0]; } omplower pass lowers the above loop to the following: D.4576 = .GOMP_USE_SIMT (); if (D.4576 != 0) goto <D.4577>; else goto <D.4578>; <D.4577>: { unsigned int D.4586; unsigned int D.4587; int D.4588; void * simduid.5; void * .omp_simt.6; int D.4596; _Bool D.4597; int D.4598; unsigned int D.4599; int D.4600; int D.4601; int * D.4602; int i [value-expr: D.4588]; int i.0; simduid.5 = .GOMP_SIMT_ENTER (simduid.5, &D.4588); .omp_simt.6 = .GOMP_SIMT_ENTER_ALLOC (simduid.5); D.4587 = 0; i.0 = 0; #pragma omp simd safelen(32) _simduid_(simduid.5) _simt_ linear(i.0:1) linear(i:1) for (i.0 = 0; i.0 < 1000; i.0 = i.0 + 1) ... } goto <D.4579>; <D.4578>: { unsigned int D.4603; unsigned int D.4604; int D.4605[0:POLY_INT_CST [15, 16]]; void * simduid.7; unsigned int D.4612; int * D.4613; int D.4614; int i [value-expr: D.4605[D.4604]]; int i.0; D.4604 = 0; i.0 = 0; #pragma omp simd safelen(POLY_INT_CST [16, 16]) _simduid_(simduid.7) linear(i.0:1) linear(i:1) ... } <D.4579>: ... For offloading to SIMT based device like nvptx, scan_omp_simd duplicates lowering of simd pragma into if-else where the if-part contains simt code-path, and else-part contains simd code-path. In lower_rec_simd_input_clauses, max_vf is set to 16+16x for the above case as determined by omp_max_vf, and that becomes length of the omp simd array: int D.4605[0:POLY_INT_CST [15, 16]]; The issue here is that, the function containing above if-else condition gets streamed out to LTO bytecode including the simd code-path and the omp simd array, whose domain is [0:POLY_INT_CST[15, 16]], and thus we get the above error while streaming-in POLY_INT_CST in lto_input_ts_poly_tree_pointers on device side. Note that, the simd code-path is essentially dead-code on nvptx, since .GOMP_USE_SIMT() resolves to 1 during omp_device_lower pass, and later optimization passes (ccp2) remove the dead-code path and unused omp simd arrays while compiling to device. So in this case, we aren't really mapping POLY_INT_CST from host to device, but it gets streamed out to device as an artefact of omp simd lowering. I suppose a proper fix here would be to (somehow) defer lowering of omp pragma simd after streaming out to device, so the device only sees simt code-path, and the host only sees simd code path ? Or perhaps clone each function in offload region, one for host and one for SIMT device, and only stream the device versions to avoid streaming out host-specific IR changes ? I thought of following approaches as workarounds: [1] Set sctx.max_vf to constant_lower_bound(omp_max_vf ()) in lower_rec_simd_input_clauses, if the function is going to be offloaded and omp_max_vf returns non-constant poly_int. For above case, it sets max_vf to 16 instead of 16+16x which seems to resolve the issue, but it'd use suboptimal max VF for host ? This is done in patch p-283-2.txt. However, with clean trunk it still seems to use max_vf = 16 after disabling the above error. vect dump shows: (compute_affine_dependence ref_a: (*_25)[i.0_51], stmt_a: _26 = (*_25)[i.0_51]; ref_b: (*_23)[i.0_51], stmt_b: (*_23)[i.0_51] = _27; ) -> dependence analysis failed foo.c:10:13: note: dependence distance = 0. foo.c:10:13: note: dependence distance == 0 between (*_23)[i.0_51] and (*_23)[i.0_51] foo.c:10:13: missed: bad data dependence. foo.c:10:13: note: ***** Analysis failed with vector mode VNx4SI This seems to happen because, loop->safelen is set to 16 by taking MIN(constant_lower_bound(16+16x), INT_MAX) in expand_omp_simd: if (!poly_int_tree_p (safelen, &val)) safelen_int = 0; else safelen_int = MIN (constant_lower_bound (val), INT_MAX); and fails to vectorize with VLA vectors, because max_vf == 16 and min_vf == 4+4x resulting in bad data dependence due to: if (max_vf != MAX_VECTORIZATION_FACTOR && maybe_lt (max_vf, min_vf)) return opt_result::failure_at (vect_location, "bad data dependence.\n"); If safelen was (somehow) set to 16+16x, I guess it could have used VF=4+4x and vectorized with VLA vectors. but I suppose that's a separate issue ? [2] Since the issue seems to be only with streaming out length of omp simd array when it's POLY_INT_CST, could we perhaps use a place holder length during omp lowering and compute the correct length after streaming out, so POLY_INT_CST doesn't get leaked into bytecode ? The attached patch p-283-3.txt follows this approach by using bogus length INT_MAX in lower_rec_simd_input_clauses if offloading to SIMT device and max_vf is non-constant poly_int, and later computing the correct length in beginning of vect pass by setting it to omp_max_vf (), but I am not sure if this is entirely correct. I am assuming that creating omp simd array of bogus length will not be an issue for nvptx since it will never get referenced and eventually be removed by remove_unused_locals ? If it'd not be a good idea to rely on the pass pipeline to eliminate simd code-path and omp simd array while compiling to device, it could be possibly done during omp_lower_device pass itself ? [3] While streaming-in POLY_INT_CST, avoid emitting error immediately if degree of POLY_INT_CST exceeds accel's NUM_POLY_INT_COEFFS to ignore POLY_INT_CSTs that may potentially occur on dead-code path, and instead mark it as error_mark_node. For the above case, since POLY_INT_CST appears on dead-code path, streaming POLY_INT_CST with higher degree than accel's NUM_POLY_INT_COEFFS would be "harmless". And detect invalid POLY_INT_CST's in expand pass (if it survives till this point), and emit above error, but not sure if that'd be the right place ? This is done in p-283-4.txt. All the three patches fix UNRESOLVED tests due to POLY_INT_CST streaming error in libgomp testsuite with -mcpu=generic+sve2. (Altho it introduces a strange FAIL for data-5.f90, which I am investigating). I would be grateful for suggestions on how to proceed. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> Thanks, Prathamesh
Set max_vf to constant lower bound if max_vf is poly_int and offloading to device. gcc/ * omp-low.cc (lower_rec_simd_input_clauses): Set max_vf to constant_lower_bound (omp_max_vf()) if offloading is enabled and max_vf is POLY_INT_CST. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index da2051b0279..780ea396b7f 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -4588,7 +4588,20 @@ lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, { if (known_eq (sctx->max_vf, 0U)) { - sctx->max_vf = sctx->is_simt ? omp_max_simt_vf () : omp_max_vf (); + if (sctx->is_simt) + sctx->max_vf = omp_max_simt_vf (); + else + { + poly_uint64 max_vf = omp_max_vf (); + /* FIXME: If the function is going to be offloaded, + and max_vf is poly_int, use constant_lower_bound as safelen, + to avoid streaming out omp simd arrays having poly_int_cst + size. */ + if (omp_maybe_offloaded_ctx (ctx) + && !max_vf.is_constant ()) + max_vf = constant_lower_bound (max_vf); + sctx->max_vf = max_vf; + } if (maybe_gt (sctx->max_vf, 1U)) { tree c = omp_find_clause (gimple_omp_for_clauses (ctx->stmt),
Use a bogus length for omp simd array and fix it up later when offloading to SIMT device. gcc/ChangeLog: * omp-low.cc (lower_rec_simd_input_clauses): Create a bogus length INT_MAX for omp simd array if max_vf is POLY_INT_CST and offloading to SIMT based device. * tree-vectorizer.cc: Include omp-general.h. (fixup_omp_simd_array_len): New function. (pass_vectorize::execute): Call fixup_omp_simd_array if function is offloaded. Signed-off-by: Prathamesh Kulkarni <prathame...@nvidia.com> diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc index da2051b0279..1a8bf0b215c 100644 --- a/gcc/omp-low.cc +++ b/gcc/omp-low.cc @@ -4658,7 +4658,21 @@ lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, } else { - tree atype = build_array_type_nelts (TREE_TYPE (new_var), sctx->max_vf); + /* FIXME: When offloading to SIMT based device, choose a bogus length + for omp simd array to avoid streaming out max_vf if it's + POLY_INT_CST. Later in vect pass, the length is set to max_vf + in fixup_omp_simd_array_len. + + Creating a omp simd array with bogus len seems fine on SIMT device + since simd code-path will be dead-code on device, and the array + will never actually be referenced. */ + poly_uint64 nelts = sctx->max_vf; + if (omp_maybe_offloaded_ctx (ctx) + && !nelts.is_constant () + && omp_max_simt_vf () > 0) + nelts = INT_MAX; + + tree atype = build_array_type_nelts (TREE_TYPE (new_var), nelts); tree avar = create_tmp_var_raw (atype); if (TREE_ADDRESSABLE (new_var)) TREE_ADDRESSABLE (avar) = 1; diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc index d4ab47349a3..0635d22074b 100644 --- a/gcc/tree-vectorizer.cc +++ b/gcc/tree-vectorizer.cc @@ -84,6 +84,7 @@ along with GCC; see the file COPYING3. If not see #include "internal-fn.h" #include "tree-ssa-sccvn.h" #include "tree-into-ssa.h" +#include "omp-general.h" /* Loop or bb location, with hotness information. */ dump_user_location_t vect_location; @@ -457,6 +458,33 @@ shrink_simd_arrays delete simd_array_to_simduid_htab; } + +/* Compute correct length for omp simd array. */ + +static void +fixup_omp_simd_array_len (function *fun) +{ + /* Look for omp simd arrays whose length is set to bogus INT_MAX value + during omp lowering, and set it to max_vf. */ + + poly_uint64 max_vf = omp_max_vf (); + if (max_vf.is_constant ()) + return; + + for (auto decl: fun->local_decls) + if (TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE + && lookup_attribute ("omp simd array", DECL_ATTRIBUTES (decl))) + { + tree& max = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (decl))); + if (TREE_CODE (max) == INTEGER_CST + && wi::eq_p (wi::to_widest (max), INT_MAX - 1)) + { + max = size_int (max_vf - 1); + relayout_decl (decl); + } + } +} + /* Initialize the vec_info with kind KIND_IN and target cost data TARGET_COST_DATA_IN. */ @@ -1247,7 +1275,11 @@ pass_vectorize::execute (function *fun) vect_slp_init (); if (fun->has_simduid_loops) - note_simd_array_uses (&simd_array_to_simduid_htab, fun); + { + if (offloading_function_p (fun->decl)) + fixup_omp_simd_array_len (fun); + note_simd_array_uses (&simd_array_to_simduid_htab, fun); + } /* ----------- Analyze loops. ----------- */
Defer emitting error for streaming POLY_INT_CST if it's degree exceeds accel's NUM_POLY_INT_COEFFS. gcc/ChangeLog: * tree-streamer-in.cc (lto_input_ts_poly_tree_pointers): Avoid emitting fatal_error and instead mark TREE_TYPE (expr) as error_mark_node. * cfgexpand.cc (expand_debug_expr): If exp is POLY_INT_CST and marked as error operand, emit fatal error. diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc index 6c1096363af..1d29b36b0ac 100644 --- a/gcc/cfgexpand.cc +++ b/gcc/cfgexpand.cc @@ -4595,6 +4595,12 @@ expand_debug_expr (tree exp) return op0; case POLY_INT_CST: +#ifdef ACCEL_COMPILER + if (error_operand_p (exp)) + fatal_error (input_location, + "degree of %<poly_int%> exceeds " + "%<NUM_POLY_INT_COEFFS%>"); +#endif return immed_wide_int_const (poly_int_cst_value (exp), mode); case COMPLEX_CST: diff --git a/gcc/tree-streamer-in.cc b/gcc/tree-streamer-in.cc index 329d218e7d4..7f450a9e91e 100644 --- a/gcc/tree-streamer-in.cc +++ b/gcc/tree-streamer-in.cc @@ -708,10 +708,11 @@ lto_input_ts_poly_tree_pointers (class lto_input_block *ib, for (; i < num_poly_int_coeffs; i++) { tree val = stream_read_tree_ref (ib, data_in); + /* FIXME: Defer emitting error immediately if degree of poly_int + exceeds accel's NUM_POLY_INT_COEFFS to ignore POLY_INT_CST's + that occur on dead-code path. */ if (!integer_zerop (val)) - fatal_error (input_location, - "degree of %<poly_int%> exceeds " - "%<NUM_POLY_INT_COEFFS%>"); + TREE_TYPE (expr) = error_mark_node; } } }