> -----Original Message-----
> From: Jakub Jelinek <[email protected]>
> Sent: 04 November 2024 21:44
> To: Prathamesh Kulkarni <[email protected]>
> Cc: Richard Biener <[email protected]>; Richard Biener
> <[email protected]>; [email protected]; Thomas Schwinge
> <[email protected]>
> Subject: Re: [RFC] Enabling SVE with offloading to nvptx
>
> External email: Use caution opening links or attachments
>
>
> On Sat, Nov 02, 2024 at 03:53:34PM +0000, Prathamesh Kulkarni wrote:
> > The attached patch adds a new bitfield needs_max_vf_lowering to
> loop,
> > and sets that in expand_omp_simd for loops that need delayed
> lowering
> > of safelen and omp simd arrays. The patch defines a new macro
> > OMP_COMMON_MAX_VF (arbitrarily set to 16), as a placeholder value
> for
> > max_vf (instead of INT_MAX), and is later replaced by appropriate
> > max_vf during omp_adjust_max_vf pass. Does that look OK ?
>
> No.
> The thing is, if user doesn't specify safelen, it defaults to infinity
> (which we represent as INT_MAX), if user specifies it, then that is
> the maximum for it (currently in OpenMP specification it is just an
> integral value, so can't be a poly int).
> And then the lowering uses the max_vf as another limit, what the hw
> can do at most and sizes the magic arrays with it. So, one needs to
> use minimum of what user specified and what the hw can handle.
> So using 16 as some magic value is just wrong, safelen(16) can be
> specified in the source as well, or safelen(8), or safelen(32) or
> safelen(123).
>
> Thus, the fact that the hw minimum hasn't been determined yet needs to
> be represented in some other flag, not in loop->safelen value, and
> before that is determined, loop->safelen should then represent what
> the user wrote (or was implied) and the later pass should use minimum
> from loop->safelen and the picked hw maximum. Of course if the picked
> hw maximum is POLY_INT-ish, the big question is how to compare that
> against the user supplied integer value, either one can just handle
> the INT_MAX (aka
> infinity) special case, or say query the backend on what is the
> maximum value of the POLY_INT at runtime and only use the POLY_INT if
> it is always known to be smaller or equal to the user supplied
> safelen.
>
> Another thing (already mentioned in the thread Andrew referenced) is
> that max_vf is used in two separate places. One is just to size of
> the magic arrays and one of the operands of the minimum (the other is
> user specified safelen). In this case, it is generally just fine to
> pick later value than strictly necessary (as long as it is never
> larger than user supplied safelen).
> The other case is simd modifier on schedule clause. That value should
> better be the right one or slightly larger, but not too much.
> I think currently we just use the INTEGER_CST we pick as the maximum,
> if this sizing is deferred, maybe it needs to be another internal
> function that asks the value (though, it can refer to a loop vf in
> another function, which complicates stuff).
>
> Regarding Richi's question, I'm afraid the OpenMP simd loop lowering
> can't be delayed until some later pass.
Hi Jakub,
Thanks for the suggestions! The attached patch makes the following changes:
(1) Delays setting of safelen for offloading by introducing a new bitfield
needs_max_vf_lowering in loop, which is true with offloading enabled,
and safelen is then set to min(safelen, max_vf) for the target later in
omp_device_lower pass.
Comparing user-specified safelen with poly_int max_vf may not be always
possible at compile-time (say 32 and 16+16x),
and even if we determine runtime VL based on -mcpu flags, I guess relying on
that won't be portable ?
The patch works around this by taking constant_lower_bound (max_vf), and
comparing it with safelen instead, with the downside
that constant_lower_bound(max_vf) will not be the optimal max_vf for SVE target
if it implements SIMD width > 128 bits.
(2) Since max_vf is used as length of omp simd array, it gets streamed out to
device, and device compilation fails during streaming-in if max_vf
is poly_int (16+16x), and device's NUM_POLY_INT_COEFFS < 2 (which motivated my
patch). The patch tries to address this by simply setting length to a
placeholder value (INT_MAX?) in lower_rec_simd_input_clauses if offloading is
enabled, and will be later set to appropriate value in omp_device_lower pass.
(3) Andrew's patches seems to already fix the case for adjusting chunk_size for
schedule clause with simd modifier by introducing a new internal
function .GOMP_MAX_VF, which is then replaced by target's max_vf. To keep it
consistent with safelen, the patch here uses constant_lower_bound (max_vf) too.
Patch passes libgomp testing for AArch64/nvptx offloading (with and without
GPU).
Does it look OK ?
Thanks,
Prathamesh
>
> Jakub
Delay lowering safelen if offloading is enabled.
gcc/ChangeLog:
* cfgloop.h (loop): New bitfield needs_max_vf_lowering.
* omp-expand.cc (expand_omp_simd): Set loop->needs_max_vf_lowering to
result
of is_in_offload_region.
* omp-low.cc (lower_rec_simd_input_clauses): Set a placeholder value
for length of omp simd array if offloading is enabled and max_vf is
non-constant poly_int.
* omp-offload.cc (adjust_max_vf): New function.
(execute_omp_device_lower): Call adjust_max_vf, and use
constant_lower_bound on result of omp_max_vf while replacing call
to .GOMP_MAX_VF.
Signed-off-by: Prathamesh Kulkarni <[email protected]>
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 30b5e40d0d9..f7b57e594fd 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -233,6 +233,12 @@ public:
flag_finite_loops or similar pragmas state. */
unsigned finite_p : 1;
+ /* True if SIMD loop needs delayed lowering of artefacts like
+ safelen and length of omp simd arrays that depend on target's
+ max_vf. This is true for offloading, when max_vf is computed after
+ streaming out to device. */
+ unsigned needs_max_vf_lowering: 1;
+
/* The number of times to unroll the loop. 0 means no information given,
just do what we always do. A value of 1 means do not unroll the loop.
A value of USHRT_MAX means unroll with no specific unrolling factor.
diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc
index 80fb1843445..5946713ac3f 100644
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -7171,6 +7171,8 @@ expand_omp_simd (struct omp_region *region, struct
omp_for_data *fd)
loop->latch = cont_bb;
add_loop (loop, l1_bb->loop_father);
loop->safelen = safelen_int;
+ loop->needs_max_vf_lowering = is_in_offload_region (region);
+
if (simduid)
{
loop->simduid = OMP_CLAUSE__SIMDUID__DECL (simduid);
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index 70a2c108fbc..0f68876a2bc 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4660,7 +4660,16 @@ lower_rec_simd_input_clauses (tree new_var, omp_context
*ctx,
}
else
{
- tree atype = build_array_type_nelts (TREE_TYPE (new_var), sctx->max_vf);
+ poly_uint64 nelts;
+ /* FIXME: If offloading is enabled, and max_vf is poly_int, avoid
+ using it as length of omp simd array, and use a placeholder value
+ instead to avoid streaming out poly_int to device. The arrays
+ will be set to correct length later in omp_device_lower pass. */
+ if (omp_maybe_offloaded_ctx (ctx) && !sctx->max_vf.is_constant ())
+ nelts = INT_MAX;
+ else
+ nelts = sctx->max_vf;
+ tree atype = build_array_type_nelts (TREE_TYPE (new_var), nelts);
tree avar = create_tmp_var_raw (atype);
if (TREE_ADDRESSABLE (new_var))
TREE_ADDRESSABLE (avar) = 1;
diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc
index 372b019f9d6..9d6467cc6a6 100644
--- a/gcc/omp-offload.cc
+++ b/gcc/omp-offload.cc
@@ -2618,6 +2618,75 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void
*)
return NULL_TREE;
}
+/* Compute max_vf for target, and accordingly set loop->safelen and length
+ of omp simd arrays. */
+
+static void
+adjust_max_vf (function *fun)
+{
+ if (!fun->has_simduid_loops)
+ return;
+
+ poly_uint64 max_vf = omp_max_vf (false);
+
+ /* FIXME: Since loop->safelen has to be an integer, it's not always possible
+ to compare against poly_int. For eg 32 and 16+16x are not comparable at
+ compile-time because 16+16x <= 32 for x < 2, but 16+16x > 32 for x >= 2.
+ Even if we could get runtime VL based on -mcpu/-march, that would not be
+ portable across other SVE archs.
+
+ For now, use constant_lower_bound (max_vf), as a "safer approximation" to
+ max_vf that avoids these issues, with the downside that it will be
+ suboptimal max_vf for SVE archs implementing SIMD width > 128 bits. */
+
+ uint64_t max_vf_int;
+ if (!max_vf.is_constant (&max_vf_int))
+ max_vf_int = constant_lower_bound (max_vf);
+
+ calculate_dominance_info (CDI_DOMINATORS);
+ for (auto loop: loops_list (fun, 0))
+ {
+ if (!loop->needs_max_vf_lowering)
+ continue;
+
+ if (loop->safelen > max_vf_int)
+ loop->safelen = max_vf_int;
+
+ basic_block *bbs = get_loop_body (loop);
+ for (unsigned i = 0; i < loop->num_nodes; i++)
+ for (auto gsi = gsi_start_bb (bbs[i]); !gsi_end_p (gsi); gsi_next
(&gsi))
+ {
+ gcall *stmt = dyn_cast<gcall *> (gsi_stmt (gsi));
+ tree lhs = NULL_TREE;
+
+ if (stmt
+ && gimple_call_internal_p (stmt)
+ && (gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE)
+ && ((lhs = gimple_call_lhs (stmt)) != NULL_TREE))
+ {
+ imm_use_iterator use_iter;
+ gimple *use_stmt;
+
+ FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, lhs)
+ {
+ gassign *assign_stmt = dyn_cast<gassign *> (use_stmt);
+ if (!assign_stmt)
+ continue;
+
+ tree assign_stmt_lhs = gimple_assign_lhs (assign_stmt);
+ if (TREE_CODE (assign_stmt_lhs) == ARRAY_REF)
+ {
+ tree array = TREE_OPERAND (assign_stmt_lhs, 0);
+ tree& max = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE
(array)));
+ max = size_int (loop->safelen - 1);
+ relayout_decl (array);
+ }
+ }
+ }
+ }
+ }
+}
+
/* Cleanup uses of SIMT placeholder internal functions: on non-SIMT targets,
VF is 1 and LANE is 0; on SIMT targets, VF is folded to a constant, and
LANE is kept to be expanded to RTL later on. Also cleanup all other SIMT
@@ -2627,6 +2696,8 @@ find_simtpriv_var_op (tree *tp, int *walk_subtrees, void
*)
static unsigned int
execute_omp_device_lower ()
{
+ adjust_max_vf (cfun);
+
int vf = targetm.simt.vf ? targetm.simt.vf () : 1;
bool regimplify = false;
basic_block bb;
@@ -2755,7 +2826,10 @@ execute_omp_device_lower ()
rhs = build_int_cst (type, vf);
break;
case IFN_GOMP_MAX_VF:
- rhs = build_int_cst (type, omp_max_vf (false));
+ /* Use constant_lower_bound, to keep max_vf consistent with
+ adjut_max_vf above. */
+ rhs = build_int_cst (type,
+ constant_lower_bound (omp_max_vf (false)));
break;
case IFN_GOMP_SIMT_ORDERED_PRED:
rhs = vf == 1 ? integer_zero_node : NULL_TREE;