Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> Just wanted to see -fdump-tree-ompexp dump say from the testcase I've
> posted.  Does your patchset have any dependencies that aren't on the trunk?
> If not, I guess I just could apply the patchset and look at the results, but
> if there are, it would need applying more.

Hm, the testcase has a reduction, which would cause the loop have a _SIMDUID
clause, which would in turn make my patch give up, setting do_simt_transform
to false.  So I'm using presence of SIMDUID to see whether the loop has any
reduction/lastprivate data, which I'm not handling for SIMT yet.

(I should really start a branch)

Alexander


Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets

2015-12-02 Thread Jakub Jelinek
On Tue, Dec 01, 2015 at 06:28:27PM +0300, Alexander Monakov wrote:
> @@ -10218,12 +10218,37 @@ expand_omp_simd (struct omp_region *region, struct 
> omp_for_data *fd)
>  
>n1 = fd->loop.n1;
>n2 = fd->loop.n2;
> +  step = fd->loop.step;
> +  bool do_simt_transform
> += (cgraph_node::get (current_function_decl)->offloadable
> +   && !broken_loop
> +   && !safelen
> +   && !simduid
> +   && !(fd->collapse > 1));

expand_omp is depth-first expansion, so for the case where the simd
region is in lexically (directly or indirectly) nested inside of a
target region, the above will not trigger.  You'd need to
use cgraph_node::get (current_function_decl)->offloadable or
just walk through outer fields of region up and see if this isn't in
a target region.

Also, please consider privatized variables in the simd loops.
int
foo (int *p)
{
  int r = 0, i;
  #pragma omp simd reduction(+:r)
  for (i = 0; i < 32; i++)
{
  p[i] += i;
  r += i;
}
  return r;
}
#pragma omp declare target to (foo)

int
main ()
{
  int p[32], err, i;
  for (i = 0; i < 32; i++)
p[i] = i;
  #pragma omp target map(tofrom:p) map(from:err)
  {
int r = 0;
#pragma omp simd reduction(+:r)
for (i = 0; i < 32; i++)
{
  p[i] += i;
  r += i;
}
err = r != 31 * 32 / 2;
err |= foo (p) != 31 * 32 / 2;
  }
  if (err)
__builtin_abort ();
  for (i = 0; i < 32; i++)
if (p[i] != 3 * i)
  __builtin_abort ();
  return 0;
}

Here, it would be nice to extend omp_max_vf in the host compiler,
such that if PTX offloading is enabled, and optimize && !optimize_debug
(and vectorizer on the host not disabled, otherwise it won't be cleaned up
on the host), it returns MIN (32, whatever it would return otherwise).
And then arrange for the stores to and other operations on the "omp simd array"
attributed arrays before/after the simd loop to be handled specially for
SIMT, basically you want those to be .local, if non-addressable handled as
any other scalars, the loop up to GOMP_SIMD_LANES run exactly once, and for
the various reductions or lastprivate selection reduce it the SIMT way or
pick value from the thread in warp that had the last SIMT lane, etc.

> +  if (do_simt_transform)
> +{
> +  tree simt_lane
> + = build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_LANE,
> + integer_type_node, 0);
> +  simt_lane = fold_convert (TREE_TYPE (step), simt_lane);
> +  simt_lane = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_lane);
> +  cfun->curr_properties &= ~PROP_gimple_lomp_dev;

How does this even compile?  simt_lane is a local var in the if
(do_simt_transform) body.
> +}
> +
>if (gimple_omp_for_combined_into_p (fd->for_stmt))
>  {
>tree innerc = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
>OMP_CLAUSE__LOOPTEMP_);
>gcc_assert (innerc);
>n1 = OMP_CLAUSE_DECL (innerc);
> +  if (do_simt_transform)
> + {
> +   n1 = fold_convert (type, n1);
> +   if (POINTER_TYPE_P (type))
> + n1 = fold_build_pointer_plus (n1, simt_lane);

And then you use it here, outside of its scope.

BTW, again, it would help if you post a simple *.ompexp dump on what exactly
you want to look it up.

Jakub


Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> expand_omp is depth-first expansion, so for the case where the simd
> region is in lexically (directly or indirectly) nested inside of a
> target region, the above will not trigger.  You'd need to
> use cgraph_node::get (current_function_decl)->offloadable or
> just walk through outer fields of region up and see if this isn't in
> a target region.

I've addressed this in my follow-up response to this patch.  Again, sorry for
the mishap, I was overconfident when adjusting the patch just before sending.

> Here, it would be nice to extend omp_max_vf in the host compiler,
> such that if PTX offloading is enabled, and optimize && !optimize_debug
> (and vectorizer on the host not disabled, otherwise it won't be cleaned up
> on the host), it returns MIN (32, whatever it would return otherwise).

Did you mean MAX (32, host_vf), not MIN?

> How does this even compile?  simt_lane is a local var in the if
> (do_simt_transform) body.

I addressed in this in the reposted patch too, a few hours after posting this
broken code.

> BTW, again, it would help if you post a simple *.ompexp dump on what exactly
> you want to look it up.

Sorry, I'm not following you here -- can you rephrase what I should post?

Thanks.
Alexander


Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 04:54:39PM +0300, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> > expand_omp is depth-first expansion, so for the case where the simd
> > region is in lexically (directly or indirectly) nested inside of a
> > target region, the above will not trigger.  You'd need to
> > use cgraph_node::get (current_function_decl)->offloadable or
> > just walk through outer fields of region up and see if this isn't in
> > a target region.
> 
> I've addressed this in my follow-up response to this patch.  Again, sorry for
> the mishap, I was overconfident when adjusting the patch just before sending.
> 
> > Here, it would be nice to extend omp_max_vf in the host compiler,
> > such that if PTX offloading is enabled, and optimize && !optimize_debug
> > (and vectorizer on the host not disabled, otherwise it won't be cleaned up
> > on the host), it returns MIN (32, whatever it would return otherwise).
> 
> Did you mean MAX (32, host_vf), not MIN?

Sure, MAX.  Though, if the SIMTification treats "omp simd array" arrays
specially, it probably only cares whether it is > 1 (because 1 disables the
"omp simd array" handling).  If all we want to achieve is that those arrays
in PTX ACCEL_COMPILER become again scalars (or aggregates or whatever they
were before) with each thread in warp writing their own, it doesn't really
care about their size that much.

> > How does this even compile?  simt_lane is a local var in the if
> > (do_simt_transform) body.
> 
> I addressed in this in the reposted patch too, a few hours after posting this
> broken code.
> 
> > BTW, again, it would help if you post a simple *.ompexp dump on what exactly
> > you want to look it up.
> 
> Sorry, I'm not following you here -- can you rephrase what I should post?

Just wanted to see -fdump-tree-ompexp dump say from the testcase I've
posted.  Does your patchset have any dependencies that aren't on the trunk?
If not, I guess I just could apply the patchset and look at the results, but
if there are, it would need applying more.

Jakub


Re: [gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets

2015-12-01 Thread Alexander Monakov
Apologies -- last-minute attempt to cleanup and enhance broke this patch;
fixed version below.  The main difference is checking whether we're
transforming a loop that might be executed on the target: checking
decl->offloadable isn't enough, because target region outlining might not have
happened yet; in that case, we need to walk the region tree upwards to check
if any containing region is a target region.

Alexander

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index a3c4a90..3189e96 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -142,6 +142,28 @@ expand_ANNOTATE (gcall *)
   gcc_unreachable ();
 }
 
+/* Lane index on SIMT targets: thread index in the warp on NVPTX.  On targets
+   without SIMT execution this should be expanded in omp_device_lower pass.  */
+
+static void
+expand_GOMP_SIMT_LANE (gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  /* FIXME: use a separate pattern for OpenMP?  */
+  gcc_assert (targetm.have_oacc_dim_pos ());
+  emit_insn (targetm.gen_oacc_dim_pos (target, const2_rtx));
+}
+
+/* This should get expanded in omp_device_lower pass.  */
+
+static void
+expand_GOMP_SIMT_VF (gcall *)
+{
+  gcc_unreachable ();
+}
+
 /* This should get expanded in adjust_simduid_builtins.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1cb14a8..66c7422 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
 
 DEF_INTERNAL_FN (LOAD_LANES, ECF_CONST | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (STORE_LANES, ECF_CONST | ECF_LEAF, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_VF, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index cc0435e..0478b2a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -10173,7 +10173,7 @@ expand_omp_simd (struct omp_region *region, struct 
omp_for_data *fd)
  OMP_CLAUSE_SAFELEN);
   tree simduid = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
  OMP_CLAUSE__SIMDUID_);
-  tree n1, n2;
+  tree n1, n2, step, simt_lane;
 
   type = TREE_TYPE (fd->loop.v);
   entry_bb = region->entry;
@@ -10218,12 +10218,36 @@ expand_omp_simd (struct omp_region *region, struct 
omp_for_data *fd)
 
   n1 = fd->loop.n1;
   n2 = fd->loop.n2;
+  step = fd->loop.step;
+  bool offloaded = cgraph_node::get (current_function_decl)->offloadable;
+  for (struct omp_region *reg = region; !offloaded && reg; reg = reg->outer)
+offloaded = reg->type == GIMPLE_OMP_TARGET;
+  bool do_simt_transform
+= offloaded && !broken_loop && !safelen && !simduid && !(fd->collapse > 1);
+  if (do_simt_transform)
+{
+  simt_lane
+   = build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_LANE,
+   integer_type_node, 0);
+  simt_lane = fold_convert (TREE_TYPE (step), simt_lane);
+  simt_lane = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_lane);
+  cfun->curr_properties &= ~PROP_gimple_lomp_dev;
+}
+
   if (gimple_omp_for_combined_into_p (fd->for_stmt))
 {
   tree innerc = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 OMP_CLAUSE__LOOPTEMP_);
   gcc_assert (innerc);
   n1 = OMP_CLAUSE_DECL (innerc);
+  if (do_simt_transform)
+   {
+ n1 = fold_convert (type, n1);
+ if (POINTER_TYPE_P (type))
+   n1 = fold_build_pointer_plus (n1, simt_lane);
+ else
+   n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, 
simt_lane));
+   }
   innerc = find_omp_clause (OMP_CLAUSE_CHAIN (innerc),
OMP_CLAUSE__LOOPTEMP_);
   gcc_assert (innerc);
@@ -10239,8 +10263,15 @@ expand_omp_simd (struct omp_region *region, struct 
omp_for_data *fd)
 }
   else
 {
-  expand_omp_build_assign (, fd->loop.v,
-  fold_convert (type, fd->loop.n1));
+  if (do_simt_transform)
+   {
+ n1 = fold_convert (type, n1);
+ if (POINTER_TYPE_P (type))
+   n1 = fold_build_pointer_plus (n1, simt_lane);
+ else
+   n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, 
simt_lane));
+   }
+  expand_omp_build_assign (, fd->loop.v, fold_convert (type, n1));
   if (fd->collapse > 1)
for (i = 0; i < fd->collapse; i++)
  {
@@ -10262,10 +10293,18 @@ expand_omp_simd (struct omp_region *region, struct 
omp_for_data *fd)
   stmt = gsi_stmt (gsi);
   gcc_assert (gimple_code (stmt) == GIMPLE_OMP_CONTINUE);
 
+  if (do_simt_transform)

[gomp-nvptx 9/9] adjust SIMD loop lowering for SIMT targets

2015-12-01 Thread Alexander Monakov
This is incomplete.

This handles OpenMP SIMD for NVPTX in simple cases, partly by punting on
anything unusual such as simduid loops, partly by getting lucky, as testcases
do not expose the missing bits.

What it currently does is transform SIMD loop

  for (V = N1; V cmp N2; V + STEP) BODY;

into

  for (V = N1 + (STEP * LANE); V cmp N2; V + (STEP * VF)) BODY;

and then folding LANE/VF to 0/1 on non-NVPTX post-ipa.

To make it proper, I'll need to handle SIMDUID loops (still thinking how to
best approach that), and SAFELEN (but that simply need a condition jump around
the loop, "if (LANE >= SAFELEN)").  Handling collapsed loops eventually should
be nice too.

Also, it needs something like __nvptx_{enter/exit}_simd() calls around the
loop, to switch from uniform to non-uniform SIMT execution (set bitmask in
__nvptx_uni from 0 to -1, and back on exit), and to switch from per-warp
soft-stacks to per-hwthread hard-stacks (by reserving a small area in .local
memory, and setting __nvptx_stacks[] pointer to top of that area).

Also, since SIMD regions should run on per-hwthread stacks, I'm thinking I'll
have to outline the loop into its own function.  Can I do that post-ipa
easily?
---
 gcc/internal-fn.c   |  22 +
 gcc/internal-fn.def |   2 +
 gcc/omp-low.c   | 138 +---
 gcc/passes.def  |   1 +
 gcc/tree-pass.h |   2 +
 5 files changed, 158 insertions(+), 7 deletions(-)

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index a3c4a90..3189e96 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -142,6 +142,28 @@ expand_ANNOTATE (gcall *)
   gcc_unreachable ();
 }
 
+/* Lane index on SIMT targets: thread index in the warp on NVPTX.  On targets
+   without SIMT execution this should be expanded in omp_device_lower pass.  */
+
+static void
+expand_GOMP_SIMT_LANE (gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  /* FIXME: use a separate pattern for OpenMP?  */
+  gcc_assert (targetm.have_oacc_dim_pos ());
+  emit_insn (targetm.gen_oacc_dim_pos (target, const2_rtx));
+}
+
+/* This should get expanded in omp_device_lower pass.  */
+
+static void
+expand_GOMP_SIMT_VF (gcall *)
+{
+  gcc_unreachable ();
+}
+
 /* This should get expanded in adjust_simduid_builtins.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 1cb14a8..66c7422 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
 
 DEF_INTERNAL_FN (LOAD_LANES, ECF_CONST | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (STORE_LANES, ECF_CONST | ECF_LEAF, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOMP_SIMT_VF, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index cc0435e..51ac0e5 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -10173,7 +10173,7 @@ expand_omp_simd (struct omp_region *region, struct 
omp_for_data *fd)
  OMP_CLAUSE_SAFELEN);
   tree simduid = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
  OMP_CLAUSE__SIMDUID_);
-  tree n1, n2;
+  tree n1, n2, step;
 
   type = TREE_TYPE (fd->loop.v);
   entry_bb = region->entry;
@@ -10218,12 +10218,37 @@ expand_omp_simd (struct omp_region *region, struct 
omp_for_data *fd)
 
   n1 = fd->loop.n1;
   n2 = fd->loop.n2;
+  step = fd->loop.step;
+  bool do_simt_transform
+= (cgraph_node::get (current_function_decl)->offloadable
+   && !broken_loop
+   && !safelen
+   && !simduid
+   && !(fd->collapse > 1));
+  if (do_simt_transform)
+{
+  tree simt_lane
+   = build_call_expr_internal_loc (UNKNOWN_LOCATION, IFN_GOMP_SIMT_LANE,
+   integer_type_node, 0);
+  simt_lane = fold_convert (TREE_TYPE (step), simt_lane);
+  simt_lane = fold_build2 (MULT_EXPR, TREE_TYPE (step), step, simt_lane);
+  cfun->curr_properties &= ~PROP_gimple_lomp_dev;
+}
+
   if (gimple_omp_for_combined_into_p (fd->for_stmt))
 {
   tree innerc = find_omp_clause (gimple_omp_for_clauses (fd->for_stmt),
 OMP_CLAUSE__LOOPTEMP_);
   gcc_assert (innerc);
   n1 = OMP_CLAUSE_DECL (innerc);
+  if (do_simt_transform)
+   {
+ n1 = fold_convert (type, n1);
+ if (POINTER_TYPE_P (type))
+   n1 = fold_build_pointer_plus (n1, simt_lane);
+ else
+   n1 = fold_build2 (PLUS_EXPR, type, n1, fold_convert (type, 
simt_lane));
+   }
   innerc = find_omp_clause (OMP_CLAUSE_CHAIN (innerc),
OMP_CLAUSE__LOOPTEMP_);