Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-24 Thread Bernd Schmidt

On 06/19/2015 03:45 PM, Jakub Jelinek wrote:


If the loop remains in the IL (isn't optimized away as unreachable or
isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
the flags on struct loop should be still there.  For the loop clauses
(reduction always, and private/lastprivate if addressable etc.) for
OpenMP simd / Cilk+ simd we use special arrays indexed by internal
functions, which then during vectorization are shrunk (but in theory could
be expanded too) to the right vectorization factor if vectorized, of course
accesses within the loop vectorized using SIMD, and if not vectorized,
shrunk to 1 element.


I'd appreciate if you could describe that mechanism in more detail. As 
far as I can tell it is very poorly commented and documented in the 
code. I mean, it doesn't even follow the minimal coding standards of 
describing function inputs:


/* Helper function of lower_rec_input_clauses, used for #pragma omp simd
   privatization.  */

static bool
lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int max_vf,
  tree idx, tree lane, tree ivar, tree lvar)


Bernd



Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-24 Thread Jakub Jelinek
On Wed, Jun 24, 2015 at 03:11:04PM +0200, Bernd Schmidt wrote:
 On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
 
 If the loop remains in the IL (isn't optimized away as unreachable or
 isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
 the flags on struct loop should be still there.  For the loop clauses
 (reduction always, and private/lastprivate if addressable etc.) for
 OpenMP simd / Cilk+ simd we use special arrays indexed by internal
 functions, which then during vectorization are shrunk (but in theory could
 be expanded too) to the right vectorization factor if vectorized, of course
 accesses within the loop vectorized using SIMD, and if not vectorized,
 shrunk to 1 element.
 
 I'd appreciate if you could describe that mechanism in more detail. As far
 as I can tell it is very poorly commented and documented in the code. I
 mean, it doesn't even follow the minimal coding standards of describing
 function inputs:
 
 /* Helper function of lower_rec_input_clauses, used for #pragma omp simd
privatization.  */
 
 static bool
 lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int max_vf,
 tree idx, tree lane, tree ivar, tree lvar)

Here is the theory behind it:
https://gcc.gnu.org/ml/gcc-patches/2013-04/msg01661.html
In the end it is using internal functions instead of uglified builtins.
I'd suggest you look at some of the libgomp.c/simd*.c tests, say
with -O2 -mavx2 -fdump-tree-{omplower,ssa,ifcvt,vect,optimized}
to see how it is lowered and expanded.  I assume #pragma omp simd roughly
corresponds to #pragma acc loop vector, maxvf for PTX vectorization is
supposedly 32 (warp size).  For SIMD vectorization, if the vectorization
fails, the arrays are shrunk to 1 element, otherwise they are shrunk to the
vectorization factor, and later optimizations if they aren't really
addressable optimized using FRE and other memory optimizations so that they
don't touch memory unless really needed.
For the PTX style vectorization (parallelization between threads in a warp),
I'd say you would always shrink to 1 element again, but such variables would
be local to each of the threads in the warp (or another possibility is
shared arrays of size 32 indexed by %tid.x  31), while addressable variables
without such magic type would be shared among all threads; non-addressable
variables (SSA_NAMEs) depending on where they are used.
You'd need to transform reductions (which are right now represented as
another loop, from 0 to an internal function, so easily recognizable) into
the PTX reductions.  Also, lastprivate is now an access to the array using
last lane internal function, dunno what that corresponds to in PTX
(perhaps also a reduction where all but the thread executing the last
iteration say or in 0 and the remaining thread ors in the lastprivate value).

Jakub


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Nathan Sidwell

On 06/22/15 11:18, Bernd Schmidt wrote:


You can have a hint that it is desirable, but not a hint that it is correct
(because passes in between may invalidate that). The OpenACC directives
guarantee to the compiler that the program can be transformed into a parallel
form. If we lose them early we must then rely on our analysis which may not be
strong enough to prove that the loop can be parallelized. If we make these
transformations early enough, while we still have the OpenACC directives, we can
guarantee that we do exactly what the programmer specified.


How does this differ from openmp's needs to preserve parallelism on a parallel 
loop?  Is it more than the reconvergence issue?


nathan

--
Nathan Sidwell


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Nathan Sidwell

On 06/22/15 12:20, Jakub Jelinek wrote:


OpenMP worksharing loop is just coordination between the threads in the
team, which thread takes which subset of the loop's iterations, and
optionally followed by a barrier.  OpenMP simd loop is a loop that has
certain properties guaranteed by the user and can be vectorized.
In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then
idles on some of them until there is work for them.


correct.  I expressed my question poorly.  What I mean is that in openmp, a loop 
that is parallelizeable (by user decree, I guess[*]), should not be transformed 
such that it is not parallelizeable.


This seems to me to be a common requirement of both languages.  How one gets 
parallel threads of execution to the body of the loop is a different question.


nathan

[*] For ones where the compiler needs to detect parallizeablilty, it's 
preferable that it doesn't do something earlier to force serializeablility.


--
Nathan Sidwell


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Julian Brown
On Mon, 22 Jun 2015 16:24:56 +0200
Jakub Jelinek ja...@redhat.com wrote:

 On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
  One problem is that (at least on the GPU hardware we've considered
  so far) we're somewhat constrained in how much control we have over
  how the underlying hardware executes code: it's possible to draw up
  a scheme where OpenACC source-level control-flow semantics are
  reflected directly in the PTX assembly output (e.g. to say all
  threads in a CTA/warp will be coherent after such-and-such a
  loop), and lowering OpenACC directives quite early seems to make
  that relatively tractable. (Even if the resulting code is
  relatively un-optimisable due to the abnormal edges inserted to
  make sure that the CFG doesn't become ill-formed.)
  
  If arbitrary optimisations are done between OMP-lowering time and
  somewhere around vectorisation (say), it's less clear if that
  correspondence can be maintained. Say if the code executed by half
  the threads in a warp becomes physically separated from the code
  executed by the other half of the threads in a warp due to some loop
  optimisation, we can no longer easily determine where that warp will
  reconverge, and certain other operations (relying on coherent warps
  -- e.g. CTA synchronisation) become impossible. A similar issue
  exists for warps within a CTA.
  
  So, essentially -- I don't know how late loop lowering would
  interact with:
  
  (a) Maintaining a CFG that will work with PTX.
  
  (b) Predication for worker-single and/or vector-single modes
  (actually all currently-proposed schemes have problems with proper
  representation of data-dependencies for variables and
  compiler-generated temporaries between predicated regions.)
 
 I don't understand why lowering the way you suggest helps here at all.
 In the proposed scheme, you essentially have whole function
 in e.g. worker-single or vector-single mode, which you need to be
 able to handle properly in any case, because users can write such
 routines themselves.  And then you can have a loop in such a function
 that has some special attribute, a hint that it is desirable to
 vectorize it (for PTX the PTX way) or use vector-single mode for it
 in a worker-single function.  So, the special pass then of course
 needs to handle all the needed broadcasting and reduction required to
 change the mode from e.g. worker-single to vector-single, but the
 convergence points still would be either on the boundary of such
 loops to be vectorized or parallelized, or wherever else they appear
 in normal vector-single or worker-single functions (around the calls
 to certainly calls?).

I think most of my concerns are centred around loops (with the markings
you suggest) that might be split into parts: if that cannot happen for
loops that are annotated as you describe, maybe things will work out OK.

(Apologies for my ignorance here, this isn't a part of the compiler
that I know anything about.)

Julian


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Bernd Schmidt

On 06/22/2015 04:24 PM, Jakub Jelinek wrote:

I don't understand why lowering the way you suggest helps here at all.
In the proposed scheme, you essentially have whole function
in e.g. worker-single or vector-single mode, which you need to be able to
handle properly in any case, because users can write such routines
themselves.  And then you can have a loop in such a function that
has some special attribute, a hint that it is desirable to vectorize it
(for PTX the PTX way) or use vector-single mode for it in a worker-single
function.


You can have a hint that it is desirable, but not a hint that it is 
correct (because passes in between may invalidate that). The OpenACC 
directives guarantee to the compiler that the program can be transformed 
into a parallel form. If we lose them early we must then rely on our 
analysis which may not be strong enough to prove that the loop can be 
parallelized. If we make these transformations early enough, while we 
still have the OpenACC directives, we can guarantee that we do exactly 
what the programmer specified.



Bernd



Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Jakub Jelinek
On Mon, Jun 22, 2015 at 12:08:36PM -0400, Nathan Sidwell wrote:
 On 06/22/15 11:18, Bernd Schmidt wrote:
 
 You can have a hint that it is desirable, but not a hint that it is correct
 (because passes in between may invalidate that). The OpenACC directives
 guarantee to the compiler that the program can be transformed into a parallel
 form. If we lose them early we must then rely on our analysis which may not 
 be
 strong enough to prove that the loop can be parallelized. If we make these
 transformations early enough, while we still have the OpenACC directives, we 
 can
 guarantee that we do exactly what the programmer specified.
 
 How does this differ from openmp's needs to preserve parallelism on a
 parallel loop?  Is it more than the reconvergence issue?

OpenMP has significantly different execution model, a parallel block in
OpenMP is run by certain number of threads (the initial thread (the one
encountering that region) and then dpeending on clauses and library
decisions perhaps others), with a barrier at the end of the region, and
afterwards only the initial thread continues again.
So, an OpenMP parallel is implemented as a library call, taking outlined
function from the parallel's body as one of its arguments and the body
is executed by the initial thread and perhaps others.
OpenMP worksharing loop is just coordination between the threads in the
team, which thread takes which subset of the loop's iterations, and
optionally followed by a barrier.  OpenMP simd loop is a loop that has
certain properties guaranteed by the user and can be vectorized.
In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then
idles on some of them until there is work for them.

Jakub


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Julian Brown
On Mon, 22 Jun 2015 16:24:56 +0200
Jakub Jelinek ja...@redhat.com wrote:

 On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
  One problem is that (at least on the GPU hardware we've considered
  so far) we're somewhat constrained in how much control we have over
  how the underlying hardware executes code: it's possible to draw up
  a scheme where OpenACC source-level control-flow semantics are
  reflected directly in the PTX assembly output (e.g. to say all
  threads in a CTA/warp will be coherent after such-and-such a
  loop), and lowering OpenACC directives quite early seems to make
  that relatively tractable. (Even if the resulting code is
  relatively un-optimisable due to the abnormal edges inserted to
  make sure that the CFG doesn't become ill-formed.)
  
  If arbitrary optimisations are done between OMP-lowering time and
  somewhere around vectorisation (say), it's less clear if that
  correspondence can be maintained. Say if the code executed by half
  the threads in a warp becomes physically separated from the code
  executed by the other half of the threads in a warp due to some loop
  optimisation, we can no longer easily determine where that warp will
  reconverge, and certain other operations (relying on coherent warps
  -- e.g. CTA synchronisation) become impossible. A similar issue
  exists for warps within a CTA.
  
  So, essentially -- I don't know how late loop lowering would
  interact with:
  
  (a) Maintaining a CFG that will work with PTX.
  
  (b) Predication for worker-single and/or vector-single modes
  (actually all currently-proposed schemes have problems with proper
  representation of data-dependencies for variables and
  compiler-generated temporaries between predicated regions.)
 
 I don't understand why lowering the way you suggest helps here at all.
 In the proposed scheme, you essentially have whole function
 in e.g. worker-single or vector-single mode, which you need to be
 able to handle properly in any case, because users can write such
 routines themselves.

In vector-single or worker-single mode, divergence of threads within a
warp or a CTA is controlled by broadcasting the controlling expression
of conditional branches to the set of inactive threads, so each of
those follows along with the active thread. So you only get
potentially-problematic thread divergence when workers or vectors are
operating in partitioned mode.

So, for instance, a made-up example:

#pragma acc parallel
{
  #pragma acc loop gang
  for (i = 0; i  N; i++))
  {
#pragma acc loop worker
for (j = 0; j  M; j++)
{
  if (j  M / 2)
/* stmt 1 */
  else
/* stmt 2 */
}

/* reconvergence point: thread barrier */

[...]
  }
}

Here stmt 1 and stmt 2 execute in worker-partitioned, vector-single
mode. With early lowering, the reconvergence point can be
inserted at the end of the loop, and abnormal edges (etc.) can be used
to ensure that the CFG does not get changed in such a way that there is
no longer a unique point at which the loop threads reconverge.

With late lowering, it's no longer obvious to me if that can still be
done.

Julian


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Jakub Jelinek
On Mon, Jun 22, 2015 at 06:48:10PM +0100, Julian Brown wrote:
 In vector-single or worker-single mode, divergence of threads within a
 warp or a CTA is controlled by broadcasting the controlling expression
 of conditional branches to the set of inactive threads, so each of
 those follows along with the active thread. So you only get
 potentially-problematic thread divergence when workers or vectors are
 operating in partitioned mode.
 
 So, for instance, a made-up example:
 
 #pragma acc parallel
 {
   #pragma acc loop gang
   for (i = 0; i  N; i++))
   {
 #pragma acc loop worker
 for (j = 0; j  M; j++)
 {
   if (j  M / 2)
 /* stmt 1 */
   else
 /* stmt 2 */
 }
 
 /* reconvergence point: thread barrier */
 
 [...]
   }
 }
 
 Here stmt 1 and stmt 2 execute in worker-partitioned, vector-single
 mode. With early lowering, the reconvergence point can be
 inserted at the end of the loop, and abnormal edges (etc.) can be used
 to ensure that the CFG does not get changed in such a way that there is
 no longer a unique point at which the loop threads reconverge.
 
 With late lowering, it's no longer obvious to me if that can still be
 done.

Why?  The loop still has an exit edge (if there is no break/return/throw out of
the loop which I bet is not allowed), so you just insert the reconvergence
point at the exit edge from the loop.
For the late lowering, I said it is up for benchmarking/investigation
where it would be best placed, it doesn't have to be after the loop passes,
there are plenty of optimization passes even before those.  But once you turn
many of the SSA_NAMEs in a function into (ab) ssa vars, many optimizations
just give up.
And, if you really want to avoid certain loop optimizations, you have always
the possibility to e.g. wrap certain statement in the loop in internal
function (e.g. the loop condition) or something similar to make the passes
more careful about those loops and make it easier to lower it later.

Jakub


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Bernd Schmidt

On 06/19/2015 03:45 PM, Jakub Jelinek wrote:

I actually believe having some optimization passes in between the ompexp
and the lowering of the IR into the form PTX wants is highly desirable,
the form with the worker-single or vector-single mode lowered will contain
too complex CFG for many optimizations to be really effective, especially
if it uses abnormal edges.  E.g. inlining supposedly would have harder job
etc.  What exact unpredictable effects do you fear?


Mostly the ones I can't predict. But let's take one example, LICM: let's 
say you pull some assignment out of a loop, then you find yourself in 
one of two possible situations: either it's become not actually 
available inside the loop (because the data and control flow is not 
described correctly and the compiler doesn't know what's going on), or, 
to avoid that, you introduce additional broadcasting operations when 
entering the loop, which might be quite expensive.



Bernd



Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Jakub Jelinek
On Mon, Jun 22, 2015 at 03:59:57PM +0200, Bernd Schmidt wrote:
 On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
 I actually believe having some optimization passes in between the ompexp
 and the lowering of the IR into the form PTX wants is highly desirable,
 the form with the worker-single or vector-single mode lowered will contain
 too complex CFG for many optimizations to be really effective, especially
 if it uses abnormal edges.  E.g. inlining supposedly would have harder job
 etc.  What exact unpredictable effects do you fear?
 
 Mostly the ones I can't predict. But let's take one example, LICM: let's say
 you pull some assignment out of a loop, then you find yourself in one of two
 possible situations: either it's become not actually available inside the
 loop (because the data and control flow is not described correctly and the
 compiler doesn't know what's going on), or, to avoid that, you introduce

Why do you think that would happen?  E.g. for non-addressable gimple types you'd
most likely just have a PHI for it on the loop.

 additional broadcasting operations when entering the loop, which might be
 quite expensive.

If the PHI has cheap initialization, there is not a problem to emit it as
initialization in the loop instead of a broadcast (kind like RA
rematerialization).  And by actually adding such an optimization, you help
even code that has computation in a vector-single code and uses it in vector
acc loop.

Jakub


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Jakub Jelinek
On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
 One problem is that (at least on the GPU hardware we've considered so
 far) we're somewhat constrained in how much control we have over how the
 underlying hardware executes code: it's possible to draw up a scheme
 where OpenACC source-level control-flow semantics are reflected directly
 in the PTX assembly output (e.g. to say all threads in a CTA/warp will
 be coherent after such-and-such a loop), and lowering OpenACC
 directives quite early seems to make that relatively tractable. (Even
 if the resulting code is relatively un-optimisable due to the abnormal
 edges inserted to make sure that the CFG doesn't become ill-formed.)
 
 If arbitrary optimisations are done between OMP-lowering time and
 somewhere around vectorisation (say), it's less clear if that
 correspondence can be maintained. Say if the code executed by half the
 threads in a warp becomes physically separated from the code executed
 by the other half of the threads in a warp due to some loop
 optimisation, we can no longer easily determine where that warp will
 reconverge, and certain other operations (relying on coherent warps --
 e.g. CTA synchronisation) become impossible. A similar issue exists for
 warps within a CTA.
 
 So, essentially -- I don't know how late loop lowering would interact
 with:
 
 (a) Maintaining a CFG that will work with PTX.
 
 (b) Predication for worker-single and/or vector-single modes
 (actually all currently-proposed schemes have problems with proper
 representation of data-dependencies for variables and
 compiler-generated temporaries between predicated regions.)

I don't understand why lowering the way you suggest helps here at all.
In the proposed scheme, you essentially have whole function
in e.g. worker-single or vector-single mode, which you need to be able to
handle properly in any case, because users can write such routines
themselves.  And then you can have a loop in such a function that
has some special attribute, a hint that it is desirable to vectorize it
(for PTX the PTX way) or use vector-single mode for it in a worker-single
function.  So, the special pass then of course needs to handle all the
needed broadcasting and reduction required to change the mode from e.g.
worker-single to vector-single, but the convergence points still would be
either on the boundary of such loops to be vectorized or parallelized, or
wherever else they appear in normal vector-single or worker-single functions
(around the calls to certainly calls?).

Jakub


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-22 Thread Julian Brown
On Fri, 19 Jun 2015 14:25:57 +0200
Jakub Jelinek ja...@redhat.com wrote:

 On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote:
  On 05/28/2015 05:08 PM, Jakub Jelinek wrote:
  
  I understand it is more work, I'd just like to ask that when
  designing stuff for the OpenACC offloading you (plural) try to
  take the other offloading devices and host fallback into account.
  
  The problem is that many of the transformations we need to do are
  really GPU specific, and with the current structure of
  omplow/ompexp they are being done in the host compiler. The
  offloading scheme we decided on does not give us the means to write
  out multiple versions of an offloaded function where each target
  gets a different one. For that reason I think we should postpone
  these lowering decisions until we're in the accel compiler, where
  they could be controlled by target hooks, and over the last two
  weeks I've been doing some experiments to see how that could be
  achieved.

 I wonder why struct loop flags and other info together with function
 attributes and/or cgraph flags and other info aren't sufficient for
 the OpenACC needs.
 Have you or Thomas looked what we're doing for OpenMP simd / Cilk+
 simd?
 
 Why can't the execution model (normal, vector-single and
 worker-single) be simply attributes on functions or cgraph node flags
 and the kind of #acc loop simply be flags on struct loop, like
 already OpenMP simd / Cilk+ simd is?

One problem is that (at least on the GPU hardware we've considered so
far) we're somewhat constrained in how much control we have over how the
underlying hardware executes code: it's possible to draw up a scheme
where OpenACC source-level control-flow semantics are reflected directly
in the PTX assembly output (e.g. to say all threads in a CTA/warp will
be coherent after such-and-such a loop), and lowering OpenACC
directives quite early seems to make that relatively tractable. (Even
if the resulting code is relatively un-optimisable due to the abnormal
edges inserted to make sure that the CFG doesn't become ill-formed.)

If arbitrary optimisations are done between OMP-lowering time and
somewhere around vectorisation (say), it's less clear if that
correspondence can be maintained. Say if the code executed by half the
threads in a warp becomes physically separated from the code executed
by the other half of the threads in a warp due to some loop
optimisation, we can no longer easily determine where that warp will
reconverge, and certain other operations (relying on coherent warps --
e.g. CTA synchronisation) become impossible. A similar issue exists for
warps within a CTA.

So, essentially -- I don't know how late loop lowering would interact
with:

(a) Maintaining a CFG that will work with PTX.

(b) Predication for worker-single and/or vector-single modes
(actually all currently-proposed schemes have problems with proper
representation of data-dependencies for variables and
compiler-generated temporaries between predicated regions.)

Julian


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-19 Thread Bernd Schmidt

On 06/19/2015 02:25 PM, Jakub Jelinek wrote:

Emitting PTX specific code from current ompexp is highly undesirable of
course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees
around for too long either, they've never meant to be used in low gimple,
and even all the early optimization passes could screw them up badly,


The idea is not to keep them around for very long, but I think there's 
no reason why they couldn't survive a while longer. Between ompexpand 
and the end of build_ssa_passes, we have (ignoring things like chkp and 
ubsan which can just be turned off for offloaded functions if necessary):

  NEXT_PASS (pass_ipa_free_lang_data);
  NEXT_PASS (pass_ipa_function_and_variable_visibility);
  NEXT_PASS (pass_fixup_cfg);
  NEXT_PASS (pass_init_datastructures);
  NEXT_PASS (pass_build_ssa);
  NEXT_PASS (pass_early_warn_uninitialized);
  NEXT_PASS (pass_nothrow);

Nothing in there strikes me as particularly problematic if we can make 
things like GIMPLE_OMP_FOR survive into-ssa - which I think I did in my 
patch. Besides, the OpenACC kernels path generates them in SSA form 
anyway during parloops so one could make the argument that this is a 
step towards better consistency.



they are also very much OpenMP or OpenACC specific, rather than representing
language neutral behavior, so there is a problem that you'd need M x N
different expansions of those constructs, which is not really maintainable
(M being number of supported offloading standards, right now 2, and N
number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).


Well, that's a problem we have anyway, independent on how we implement 
all these devices and standards. I don't see how that's relevant to the 
discussion.



I wonder why struct loop flags and other info together with function
attributes and/or cgraph flags and other info aren't sufficient for the
OpenACC needs.
Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?



Why can't the execution model (normal, vector-single and worker-single)
be simply attributes on functions or cgraph node flags and the kind of
#acc loop simply be flags on struct loop, like already OpenMP simd
/ Cilk+ simd is?


We haven't looked at Cilk+ or anything like that. You suggest using 
attributes and flags, but at what point do you intend to actually lower 
the IR to actually represent what's going on?



The vector level parallelism is something where on the host/host_noshm/XeonPhi
(dunno about HSA) you want vectorization to happen, and that is already
implemented in the vectorizer pass, implementing it again elsewhere is
highly undesirable.  For PTX the implementation is of course different,
and the vectorizer is likely not the right pass to handle them, but why
can't the same struct loop flags be used by the pass that handles the
conditionalization of execution for the 2 of the 3 above modes?


Agreed on wanting the vectorizer to handle things for normal machines, 
that is one of the motivations for pushing the lowering past the offload 
LTO writeout stage. The problem with OpenACC on GPUs is that the 
predication really changes the CFG and the data flow - I fear 
unpredictable effects if we let any optimizers run before lowering 
OpenACC to the point where we actually represent what's going on in the 
function.



Bernd



Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-19 Thread Jakub Jelinek
On Fri, Jun 19, 2015 at 03:03:38PM +0200, Bernd Schmidt wrote:
 they are also very much OpenMP or OpenACC specific, rather than representing
 language neutral behavior, so there is a problem that you'd need M x N
 different expansions of those constructs, which is not really maintainable
 (M being number of supported offloading standards, right now 2, and N
 number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).
 
 Well, that's a problem we have anyway, independent on how we implement all
 these devices and standards. I don't see how that's relevant to the
 discussion.

It is relevant, because if you lower early (omplower/ompexp) into some IL
form common to all the offloading standards, then it is M + N.

 I wonder why struct loop flags and other info together with function
 attributes and/or cgraph flags and other info aren't sufficient for the
 OpenACC needs.
 Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?
 
 Why can't the execution model (normal, vector-single and worker-single)
 be simply attributes on functions or cgraph node flags and the kind of
 #acc loop simply be flags on struct loop, like already OpenMP simd
 / Cilk+ simd is?
 
 We haven't looked at Cilk+ or anything like that. You suggest using
 attributes and flags, but at what point do you intend to actually lower the
 IR to actually represent what's going on?

I think around where the vectorizer is, perhaps before the loop optimization
pass queue (or after it, some investigation is needed).

 The vector level parallelism is something where on the 
 host/host_noshm/XeonPhi
 (dunno about HSA) you want vectorization to happen, and that is already
 implemented in the vectorizer pass, implementing it again elsewhere is
 highly undesirable.  For PTX the implementation is of course different,
 and the vectorizer is likely not the right pass to handle them, but why
 can't the same struct loop flags be used by the pass that handles the
 conditionalization of execution for the 2 of the 3 above modes?
 
 Agreed on wanting the vectorizer to handle things for normal machines,
 that is one of the motivations for pushing the lowering past the offload LTO
 writeout stage. The problem with OpenACC on GPUs is that the predication
 really changes the CFG and the data flow - I fear unpredictable effects if
 we let any optimizers run before lowering OpenACC to the point where we
 actually represent what's going on in the function.

I actually believe having some optimization passes in between the ompexp
and the lowering of the IR into the form PTX wants is highly desirable,
the form with the worker-single or vector-single mode lowered will contain
too complex CFG for many optimizations to be really effective, especially
if it uses abnormal edges.  E.g. inlining supposedly would have harder job
etc.  What exact unpredictable effects do you fear?
If the loop remains in the IL (isn't optimized away as unreachable or
isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
the flags on struct loop should be still there.  For the loop clauses
(reduction always, and private/lastprivate if addressable etc.) for
OpenMP simd / Cilk+ simd we use special arrays indexed by internal
functions, which then during vectorization are shrunk (but in theory could
be expanded too) to the right vectorization factor if vectorized, of course
accesses within the loop vectorized using SIMD, and if not vectorized,
shrunk to 1 element.  So the PTX IL lowering pass could use the same
arrays (omp simd array attribute) to transform the decls into thread local
vars as opposed to vars shared by the whole CTA.

Jakub


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-19 Thread Jakub Jelinek
On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote:
 On 05/28/2015 05:08 PM, Jakub Jelinek wrote:
 
 I understand it is more work, I'd just like to ask that when designing stuff
 for the OpenACC offloading you (plural) try to take the other offloading
 devices and host fallback into account.
 
 The problem is that many of the transformations we need to do are really GPU
 specific, and with the current structure of omplow/ompexp they are being
 done in the host compiler. The offloading scheme we decided on does not give
 us the means to write out multiple versions of an offloaded function where
 each target gets a different one. For that reason I think we should postpone
 these lowering decisions until we're in the accel compiler, where they could
 be controlled by target hooks, and over the last two weeks I've been doing
 some experiments to see how that could be achieved.

Emitting PTX specific code from current ompexp is highly undesirable of
course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees
around for too long either, they've never meant to be used in low gimple,
and even all the early optimization passes could screw them up badly,
they are also very much OpenMP or OpenACC specific, rather than representing
language neutral behavior, so there is a problem that you'd need M x N
different expansions of those constructs, which is not really maintainable
(M being number of supported offloading standards, right now 2, and N
number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).

I wonder why struct loop flags and other info together with function
attributes and/or cgraph flags and other info aren't sufficient for the
OpenACC needs.
Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?

Why can't the execution model (normal, vector-single and worker-single)
be simply attributes on functions or cgraph node flags and the kind of
#acc loop simply be flags on struct loop, like already OpenMP simd
/ Cilk+ simd is?

I mean, you need to implement the PTX broadcasting etc. for the 3 different
modes (one where each thread executes everything, another one where
only first thread in a warp executes everything, other threads only
call functions with the same mode, or specially marked loops), another one
where only a single thread (in the CTA) executes everything, other threads
only call functions with the same mode or specially marked loops, because
if you have #acc routine (something) ... that is just an attribute of a
function, not really some construct in the body of it.

The vector level parallelism is something where on the host/host_noshm/XeonPhi
(dunno about HSA) you want vectorization to happen, and that is already
implemented in the vectorizer pass, implementing it again elsewhere is
highly undesirable.  For PTX the implementation is of course different,
and the vectorizer is likely not the right pass to handle them, but why
can't the same struct loop flags be used by the pass that handles the
conditionalization of execution for the 2 of the 3 above modes?

Then there is the worker level parallelism, but I'd hope it can be handled
similarly, and supposedly the pass that handles vector-single and
worker-single lowering for PTX could do the same for non-PTX targets
- if the OpenACC execution model is that all the (e.g. pthread based)
threads are started immediately and you skip in worker-single mode work on
other than the first thread, then it needs to behave similarly to PTX,
just probably needs to use library calls rather than PTX builtins to query
the thread number.

Jakub


Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-19 Thread Bernd Schmidt

On 05/28/2015 05:08 PM, Jakub Jelinek wrote:


I understand it is more work, I'd just like to ask that when designing stuff
for the OpenACC offloading you (plural) try to take the other offloading
devices and host fallback into account.


The problem is that many of the transformations we need to do are really 
GPU specific, and with the current structure of omplow/ompexp they are 
being done in the host compiler. The offloading scheme we decided on 
does not give us the means to write out multiple versions of an 
offloaded function where each target gets a different one. For that 
reason I think we should postpone these lowering decisions until we're 
in the accel compiler, where they could be controlled by target hooks, 
and over the last two weeks I've been doing some experiments to see how 
that could be achieved.


The basic idea is to delay expanding the inner regions of an OpenACC 
target region during ompexp, write out offload LTO (almost) immediately 
afterwards, and then have another ompexp phase which runs on the accel 
compiler to take the offloaded function to its final form. The first 
attempt really did write LTO immediately after, before moving to SSA 
phase. It seems that this could be made to work, but the pass manager 
and LTO code rather expects that what is being read in is in SSA form 
already. Also, some offloaded code is produced by OpenACC kernels 
expansion much later in the compilation, so with this approach we have 
an inconsistency where functions we get back from LTO are at very 
different levels of lowering.


The next attempt was to run the into-ssa passes after ompexpand, and 
only then write things out. I've changed the gimple representation of 
some OMP statements (primarily gimple_omp_for) so that they are 
relatively normal statements with operands that can be transformed into 
SSA form. As far as what's easier to work with - I believe some of the 
transformations we have to do could benefit from being in SSA, but on 
the other hand the OpenACC predication code has given me some trouble. 
I've still not sompletely convinced myself that the update_ssa call I've 
added will actually do the right thing after we've mucked up the CFG.


I'm appending a proof-of-concept patch. This is intended to show the 
general outline of what I have in mind, rather than pass the testsuite. 
It's good enough to compile some of the OpenACC testcases (let's say 
worker-single-3 if you need one). Let me know what you think.



Bernd

Index: gcc/cgraphunit.c
===
--- gcc/cgraphunit.c	(revision 224547)
+++ gcc/cgraphunit.c	(working copy)
@@ -2171,6 +2171,23 @@ ipa_passes (void)
   execute_ipa_pass_list (passes-all_small_ipa_passes);
   if (seen_error ())
 	return;
+
+  if (g-have_offload)
+	{
+	  extern void write_offload_lto ();
+	  section_name_prefix = OFFLOAD_SECTION_NAME_PREFIX;
+	  write_offload_lto ();
+	}
+}
+  bool do_local_opts = !in_lto_p;
+#ifdef ACCEL_COMPILER
+  do_local_opts = true;
+#endif
+  if (do_local_opts)
+{
+  execute_ipa_pass_list (passes-all_local_opt_passes);
+  if (seen_error ())
+	return;
 }
 
   /* This extra symtab_remove_unreachable_nodes pass tends to catch some
@@ -2182,7 +2199,7 @@ ipa_passes (void)
   if (symtab-state  IPA_SSA)
 symtab-state = IPA_SSA;
 
-  if (!in_lto_p)
+  if (do_local_opts)
 {
   /* Generate coverage variables and constructors.  */
   coverage_finish ();
@@ -2285,6 +2302,14 @@ symbol_table::compile (void)
   if (seen_error ())
 return;
 
+#ifdef ACCEL_COMPILER
+  {
+cgraph_node *node;
+FOR_EACH_DEFINED_FUNCTION (node)
+  node-get_untransformed_body ();
+  }
+#endif
+
 #ifdef ENABLE_CHECKING
   symtab_node::verify_symtab_nodes ();
 #endif
Index: gcc/config/nvptx/nvptx.c
===
--- gcc/config/nvptx/nvptx.c	(revision 224547)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -1171,18 +1171,42 @@ nvptx_section_from_addr_space (addr_spac
 }
 }
 
-/* Determine whether DECL goes into .const or .global.  */
+/* Determine the address space DECL lives in.  */
 
-const char *
-nvptx_section_for_decl (const_tree decl)
+static addr_space_t
+nvptx_addr_space_for_decl (const_tree decl)
 {
+  if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL)
+return ADDR_SPACE_GENERIC;
+
+  if (lookup_attribute (oacc ganglocal, DECL_ATTRIBUTES (decl)) != NULL_TREE)
+return ADDR_SPACE_SHARED;
+
   bool is_const = (CONSTANT_CLASS_P (decl)
 		   || TREE_CODE (decl) == CONST_DECL
 		   || TREE_READONLY (decl));
   if (is_const)
-return .const;
+return ADDR_SPACE_CONST;
 
-  return .global;
+  return ADDR_SPACE_GLOBAL;
+}
+
+/* Return a ptx string representing the address space for a variable DECL.  */
+
+const char *
+nvptx_section_for_decl (const_tree decl)
+{
+  switch (nvptx_addr_space_for_decl (decl))
+{
+case ADDR_SPACE_CONST:
+  return .const;
+   

Re: [gomp4] Preserve NVPTX reconvergence points

2015-06-03 Thread Julian Brown
On Thu, 28 May 2015 16:37:04 +0200
Richard Biener richard.guent...@gmail.com wrote:

 On Thu, May 28, 2015 at 4:06 PM, Julian Brown
 jul...@codesourcery.com wrote:
  For NVPTX, it is vitally important that the divergence of threads
  within a warp can be controlled: in particular we must be able to
  generate code that we know reconverges at a particular point.
  Unfortunately GCC's middle-end optimisers can cause this property to
  be violated, which causes problems for the OpenACC execution model
  we're planning to use for NVPTX.
 
 Hmm, I don't think adding a new edge flag is good nor necessary.  It
 seems to me that instead the broadcast operation should have abnormal
 control flow and thus basic-blocks should be split either before or
 after it (so either incoming or outgoing edge(s) should be
 abnormal).  I suppose splitting before the broadcast would be best
 (thus handle it similar to setjmp ()).

Here's a version of the patch that uses abnormal edges with semantics
unchanged, splitting the false/non-execution edge using a dummy block
to avoid the prohibited case of both EDGE_TRUE/EDGE_FALSE and
EDGE_ABNORMAL on the outgoing edges of a GIMPLE_COND.

So for a fragment like this:

  if (threadIdx.x == 0) /* cond_bb */
  {
/* work */
p0 = ...; /* assign */
  }
  pN = broadcast(p0);
  if (pN) goto T; else goto F;

Incoming edges to a broadcast operation have EDGE_ABNORMAL set:

  ++
  |cond_bb |,
  ++|
  | (true edge) | (false edge)
  v v
  ++ +---+
  | (work) | | dummy |
  ++ +---+
  | assign ||
  ++|
ABNORM| |ABNORM
  v |
  ++---'
  |  bcast |
  ++
  |  cond  |
  ++
   / \
  T   F

The abnormal edges actually serve two purposes, I think: as well as
ensuring the broadcast operation takes place when a warp is
non-diverged/coherent, they ensure that p0 is not seen as uninitialised
along the false path from cond_bb, possibly leading to the broadcast
operation being optimised away as partially redundant. This feels
somewhat fragile though! We'll have to continue to think about
warp divergence in subsequent patches.

The patch passes libgomp testing (with Bernd's recent worker-single
patch also). OK for gomp4 branch (together with the
previously-mentioned inline thread builtin patch)?

Thanks,

Julian

ChangeLog

gcc/
* omp-low.c (make_predication_test): Split false block out of
cond_bb, making latter edge abnormal.
(predicate_bb): Set EDGE_ABNORMAL on edges before broadcast
operations.commit 38056ae4a29f93ce54715dfad843a233f3b0fd2a
Author: Julian Brown jul...@codesourcery.com
Date:   Mon Jun 1 11:12:41 2015 -0700

Use abnormal edges before broadcast ops

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 7048f9f..310eb72 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -10555,7 +10555,16 @@ make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
   gsi_insert_after (tmp_gsi, cond_stmt, GSI_NEW_STMT);
 
   true_edge-flags = EDGE_TRUE_VALUE;
-  make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
+
+  /* Force an abnormal edge before a broadcast operation that might be present
+ in SKIP_DEST_BB.  This is only done for the non-execution edge (with
+ respect to the predication done by this function) -- the opposite
+ (execution) edge that reaches the broadcast operation must be made
+ abnormal also, e.g. in this function's caller.  */
+  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
+  basic_block false_abnorm_bb = split_edge (e);
+  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
+  abnorm_edge-flags |= EDGE_ABNORMAL;
 }
 
 /* Apply OpenACC predication to basic block BB which is in
@@ -10605,6 +10614,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask)
 		   mask);
 
   edge e = split_block (bb, splitpoint);
+  e-flags = EDGE_ABNORMAL;
   skip_dest_bb = e-dest;
 
   gimple_cond_set_condition (as_a gcond * (stmt), EQ_EXPR,
@@ -10624,6 +10634,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask)
 		   gsi_asgn, mask);
 
   edge e = split_block (bb, splitpoint);
+  e-flags = EDGE_ABNORMAL;
   skip_dest_bb = e-dest;
 
   gimple_switch_set_index (sstmt, new_var);


Re: [gomp4] Preserve NVPTX reconvergence points

2015-05-28 Thread Richard Biener
On Thu, May 28, 2015 at 4:06 PM, Julian Brown jul...@codesourcery.com wrote:
 For NVPTX, it is vitally important that the divergence of threads
 within a warp can be controlled: in particular we must be able to
 generate code that we know reconverges at a particular point.
 Unfortunately GCC's middle-end optimisers can cause this property to
 be violated, which causes problems for the OpenACC execution model
 we're planning to use for NVPTX.

 As a brief example: code running in vector-single mode runs on a
 single thread of a warp, and must broadcast condition results to other
 threads of the warp so that they can follow along and be ready for
 vector-partitioned execution when necessary.

 #pragma acc parallel
 {
   #pragma acc loop gang
   for (i = 0; i  N; i++)
   {
 /* This is vector-single mode.  */
 n = ...;
 switch (n)
 {
 case 1:
   #pragma acc loop vector
   for (...)
   {
 /* This is vector-partitioned mode.  */
   }
   ...
 }
   }
 }

 Here, the calculation n = ... takes place on a single thread (of
 each partitioned gang of the outer loop), but the switch statement
 (terminating the BB) must be executed by all threads in the warp. The
 vector-single statements will be translated using a branch around for
 the idle threads:

 if (threadIdx.x == 0)
 {
   n_0 = ...;
 }
 n_x = broadcast (n_0)
 switch (n_x)
 ...

 Where broadcast is an operation that transfers values from some
 other thread of a warp (i.e., the zeroth) to the current thread
 (implemented as a shfl instruction for NVPTX).

 I observed a similar example to this cloning the broadcast and switch
 instructions (in the .dom1 dump), along the lines of:

 if (threadIdx.x == 0)
 {
   n_0 = ...;
   n_x = broadcast (n_0)
   switch (n_x)
   ...
 }
 else
 {
   n_x = broadcast (n_0)
   switch (n_x)
   ...
 }

 This doesn't work because the broadcast operation has to be run with
 non-diverged warps for correct operation, and here there is divergence
 due to the if (threadIdx.x == 0) condition.

 So, the way I have tried to handle this is by attempting to inhibit
 optimisation along edges which have a reconvergence point as their
 destination. The essential idea is to make such edges abnormal,
 although the existing EDGE_ABNORMAL flag is not used because that has
 implicit meaning built into it already, and the new edge type may need
 to be handled differently in some areas. One example is that at
 present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL
 set on their EDGE_TRUE or EDGE_FALSE outgoing edges.

 The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE),
 for the GIMPLE CFG only. In principle there's nothing to stop the flag
 being propagated to the RTL CFG also, in which case it'd probably be
 set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g.
 EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which
 inspect the RTL CFG can continue to only check the ABNORMAL flag. But
 so far (in rather limited testing!), that has not been observed to be
 necessary. (We can control RTL CFG manipulation indirectly by using the
 CANNOT_COPY_INSN_P target hook, sensitive e.g. to the broadcast
 instruction.)

 For the GIMPLE CFG (i.e. in passes operating on GIMPLE form),
 EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e.,
 inhibiting certain optimisations), and so has been added to relevant
 conditionals largely mechanically. Places where it is treated specially
 are:

 * tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on
   outgoing edges of a block concluding with a GIMPLE_COND statement.
   But, we allow EDGE_TO_RECONVERGENCE there.

 * tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND
   edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass
   tries to split the edge later).

 There are probably other optimisations that will be tripped up by the
 new flag along the same lines as the VRP tweak above, which we will no
 doubt discover in due course.

 Together with the patch,

   https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html

 This shows no regressions for the libgomp tests.

 OK for gomp4 branch?

Hmm, I don't think adding a new edge flag is good nor necessary.  It seems to
me that instead the broadcast operation should have abnormal control flow
and thus basic-blocks should be split either before or after it (so either
incoming or outgoing edge(s) should be abnormal).  I suppose splitting
before the broadcast would be best (thus handle it similar to setjmp ()).

Richard.

 Thanks,

 Julian

 ChangeLog

 gcc/
 * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag.
 (bb_hash_abnorm_or_reconv_pred): New function.
 (hash_abnormal_or_eh_outgoing_edge_p): Consider
 EDGE_TO_RECONVERGENCE also.
 * cfg-flags.def (TO_RECONVERGENCE): Add flag.
 * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges
 leading to a reconvergence 

[gomp4] Preserve NVPTX reconvergence points

2015-05-28 Thread Julian Brown
For NVPTX, it is vitally important that the divergence of threads
within a warp can be controlled: in particular we must be able to
generate code that we know reconverges at a particular point.
Unfortunately GCC's middle-end optimisers can cause this property to
be violated, which causes problems for the OpenACC execution model
we're planning to use for NVPTX.

As a brief example: code running in vector-single mode runs on a
single thread of a warp, and must broadcast condition results to other
threads of the warp so that they can follow along and be ready for
vector-partitioned execution when necessary.

#pragma acc parallel
{
  #pragma acc loop gang
  for (i = 0; i  N; i++)
  {
/* This is vector-single mode.  */
n = ...;
switch (n)
{
case 1:
  #pragma acc loop vector
  for (...)
  {
/* This is vector-partitioned mode.  */
  }
  ...
}
  }
}

Here, the calculation n = ... takes place on a single thread (of
each partitioned gang of the outer loop), but the switch statement
(terminating the BB) must be executed by all threads in the warp. The
vector-single statements will be translated using a branch around for
the idle threads:

if (threadIdx.x == 0)
{
  n_0 = ...;
}
n_x = broadcast (n_0)
switch (n_x)
...

Where broadcast is an operation that transfers values from some
other thread of a warp (i.e., the zeroth) to the current thread
(implemented as a shfl instruction for NVPTX).

I observed a similar example to this cloning the broadcast and switch
instructions (in the .dom1 dump), along the lines of:

if (threadIdx.x == 0)
{
  n_0 = ...;
  n_x = broadcast (n_0)
  switch (n_x)
  ...
}
else
{
  n_x = broadcast (n_0)
  switch (n_x)
  ...
}

This doesn't work because the broadcast operation has to be run with
non-diverged warps for correct operation, and here there is divergence
due to the if (threadIdx.x == 0) condition.

So, the way I have tried to handle this is by attempting to inhibit
optimisation along edges which have a reconvergence point as their
destination. The essential idea is to make such edges abnormal,
although the existing EDGE_ABNORMAL flag is not used because that has
implicit meaning built into it already, and the new edge type may need
to be handled differently in some areas. One example is that at
present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL
set on their EDGE_TRUE or EDGE_FALSE outgoing edges.

The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE),
for the GIMPLE CFG only. In principle there's nothing to stop the flag
being propagated to the RTL CFG also, in which case it'd probably be
set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g.
EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which
inspect the RTL CFG can continue to only check the ABNORMAL flag. But
so far (in rather limited testing!), that has not been observed to be
necessary. (We can control RTL CFG manipulation indirectly by using the
CANNOT_COPY_INSN_P target hook, sensitive e.g. to the broadcast
instruction.)

For the GIMPLE CFG (i.e. in passes operating on GIMPLE form),
EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e.,
inhibiting certain optimisations), and so has been added to relevant
conditionals largely mechanically. Places where it is treated specially
are:

* tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on
  outgoing edges of a block concluding with a GIMPLE_COND statement.
  But, we allow EDGE_TO_RECONVERGENCE there.

* tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND
  edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass
  tries to split the edge later).

There are probably other optimisations that will be tripped up by the
new flag along the same lines as the VRP tweak above, which we will no
doubt discover in due course.

Together with the patch,

  https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html

This shows no regressions for the libgomp tests.

OK for gomp4 branch?

Thanks,

Julian

ChangeLog

gcc/
* basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag.
(bb_hash_abnorm_or_reconv_pred): New function.
(hash_abnormal_or_eh_outgoing_edge_p): Consider
EDGE_TO_RECONVERGENCE also.
* cfg-flags.def (TO_RECONVERGENCE): Add flag.
* omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges
leading to a reconvergence point.
* cfgbuild.c (purge_dead_tablejump_edges): Consider
EDGE_TO_RECONVERGENCE.
* cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise.
* cfgexpand.c (expand_gimple_tailcall, construct_exit_block)
(pass_expand::execute): Likewise.
* cfghooks.c (can_copy_bbs_p): Likewise.
* cfgloop.c (bb_loop_header_p): Likewise.
* cfgloopmanip.c (scale_loop_profile): Likewise.
* gimple-iterator.c (gimple_find_edge_insert_loc): Likewise.
* graph.c (draw_cfg_node_succ_edges): Likewise.
* graphite-scope-detection.c 

Re: [gomp4] Preserve NVPTX reconvergence points

2015-05-28 Thread Jakub Jelinek
On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote:
 For NVPTX, it is vitally important that the divergence of threads
 within a warp can be controlled: in particular we must be able to
 generate code that we know reconverges at a particular point.
 Unfortunately GCC's middle-end optimisers can cause this property to
 be violated, which causes problems for the OpenACC execution model
 we're planning to use for NVPTX.
 
 As a brief example: code running in vector-single mode runs on a
 single thread of a warp, and must broadcast condition results to other
 threads of the warp so that they can follow along and be ready for
 vector-partitioned execution when necessary.

I think the lowering of this already at ompexp time is premature,
I think much better would be to have a function attribute (or cgraph
flag) that would be set for functions you want to compile this way
(plus a targetm flag that the targets want to support it that way),
plus a flag in loop structure for the acc loop vector loops
(perhaps the current OpenMP simd loop flags are good enough for that),
and lower it somewhere around the vectorization pass or so.

Or, what exactly do you emit for the fallback code, or for other GPGPUs
or XeonPhi?  To me e.g. for XeonPhi or HSA this sounds like you
want to implement the acc loop gang as a work-sharing loop among
threads (like #pragma omp for) and #pragma acc loop vector like
a loop that should be vectorized if at all possible (like #pragma omp simd).
I really think it is important that OpenACC GCC support is not so strongly
tied to one specific GPGPU, and similarly OpenMP should be usable for
all offloading targets GCC supports.

That way, it is possible to auto-vectorize the code too, decision how
to expand the code of offloaded function is done already separately for each
offloading target, there is a space for optimizations on much simpler
cfg, etc.

Jakub


Re: [gomp4] Preserve NVPTX reconvergence points

2015-05-28 Thread Thomas Schwinge
Hi!

On Thu, 28 May 2015 16:20:11 +0200, Jakub Jelinek ja...@redhat.com wrote:
 On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote:
  [...]

 I think the lowering of this already at ompexp time is premature

Yes, we're aware of this wart.  :-|

 I think much better would be to have a function attribute (or cgraph
 flag) that would be set for functions you want to compile this way
 (plus a targetm flag that the targets want to support it that way),
 plus a flag in loop structure for the acc loop vector loops
 (perhaps the current OpenMP simd loop flags are good enough for that),
 and lower it somewhere around the vectorization pass or so.

Moving the loop lowering/expansion later is along the same lines as we've
been thinking.  Figuring out how the OpenMP simd implementation works, is
another thing I wanted to look into.

 Or, what exactly do you emit for the fallback code, or for other GPGPUs
 or XeonPhi?  To me e.g. for XeonPhi or HSA this sounds like you
 want to implement the acc loop gang as a work-sharing loop among
 threads (like #pragma omp for) and #pragma acc loop vector like
 a loop that should be vectorized if at all possible (like #pragma omp simd).
 I really think it is important that OpenACC GCC support is not so strongly
 tied to one specific GPGPU

Not disagreeing, but: we have to start somewhere.  GPU offloading and all
its peculiarities is still entering unknown terriroty in GCC; we're still
learning, and shall try to converge the emerging different
implementations in the future.  Doing the completely generic (agnostic of
specific offloading device) implementation right now is a challenging
task, hence the work on a nvptx-specific prototype first, to put it
this way.

That said, we of course very much welcome your continued review of our
work, and your suggestions!

 and similarly OpenMP should be usable for
 all offloading targets GCC supports.
 
 That way, it is possible to auto-vectorize the code too, decision how
 to expand the code of offloaded function is done already separately for each
 offloading target, there is a space for optimizations on much simpler
 cfg, etc.


Grüße,
 Thomas


pgpNo5kt_UfFt.pgp
Description: PGP signature


Re: [gomp4] Preserve NVPTX reconvergence points

2015-05-28 Thread Jakub Jelinek
On Thu, May 28, 2015 at 04:49:43PM +0200, Thomas Schwinge wrote:
  I think much better would be to have a function attribute (or cgraph
  flag) that would be set for functions you want to compile this way
  (plus a targetm flag that the targets want to support it that way),
  plus a flag in loop structure for the acc loop vector loops
  (perhaps the current OpenMP simd loop flags are good enough for that),
  and lower it somewhere around the vectorization pass or so.
 
 Moving the loop lowering/expansion later is along the same lines as we've
 been thinking.  Figuring out how the OpenMP simd implementation works, is
 another thing I wanted to look into.

The OpenMP simd expansion is actually quite simple thing.
Basically, the simd loop is in ompexp expanded as a normal loop with some
flags in the loop structure (which are pretty much optimization hints).
There is a flag that the user would really like to vectorize it, and another
field that says (from what user told) what vectorization factor is safe to
use regardless of compiler's analysis.  There is some complications with
privatization clauses, so some variables are in GIMPLE represented as arrays
with maximum vf elements and indexed by internal function (simd lane), which
the vectorizer then either turns into a scalar again (if the loop isn't
vectorized), or vectorizes it and for addressables keeps in arrays with
actual vf elements.

I admit I don't know too much about OpenACC, but I'd think doing something
similar (i.e. some loop structure hint or request that a particular loop is
vectorized and perhaps something about lexical forward/backward dependencies
in the loop) could work.  Then for XeonPhi or host fallback, you'd just use
normal vectorizer.  And for PTX you could instead about the same time
instead of vectorization lower code to a single working thread doing stuff
except for simd marked loops which would be lowered to run on all threads
in the warp.

 Not disagreeing, but: we have to start somewhere.  GPU offloading and all
 its peculiarities is still entering unknown terriroty in GCC; we're still
 learning, and shall try to converge the emerging different
 implementations in the future.  Doing the completely generic (agnostic of
 specific offloading device) implementation right now is a challenging
 task, hence the work on a nvptx-specific prototype first, to put it
 this way.

I understand it is more work, I'd just like to ask that when designing stuff
for the OpenACC offloading you (plural) try to take the other offloading
devices and host fallback into account.  E.g. the XeonPhi is not hard to
understand, it is pretty much just a many core x86_64 chip where the
offloading is some process how to run something on the other device
and the emulation mode very well emulates that through running it in a
different process.  This stuff is already about what happens in offloaded
code, so considerations for it are similar to those for host code
(especially hosts that can vectorize).

As far as OpenMP / PTX goes, I'll try to find time for it again soon
(busy with OpenMP 4.1 work so far), but e.g. the above stuff (having
a single thread in warp do most of the non-vectorized work, and only
use other threads in the warp for vectorization) is definitely what
OpenMP will benefit from too.

Jakub