On Fri, 19 Jun 2015 14:25:57 +0200
Jakub Jelinek <ja...@redhat.com> wrote:

> On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote:
> > On 05/28/2015 05:08 PM, Jakub Jelinek wrote:
> > 
> > >I understand it is more work, I'd just like to ask that when
> > >designing stuff for the OpenACC offloading you (plural) try to
> > >take the other offloading devices and host fallback into account.
> > 
> > The problem is that many of the transformations we need to do are
> > really GPU specific, and with the current structure of
> > omplow/ompexp they are being done in the host compiler. The
> > offloading scheme we decided on does not give us the means to write
> > out multiple versions of an offloaded function where each target
> > gets a different one. For that reason I think we should postpone
> > these lowering decisions until we're in the accel compiler, where
> > they could be controlled by target hooks, and over the last two
> > weeks I've been doing some experiments to see how that could be
> > achieved.

> I wonder why struct loop flags and other info together with function
> attributes and/or cgraph flags and other info aren't sufficient for
> the OpenACC needs.
> Have you or Thomas looked what we're doing for OpenMP simd / Cilk+
> simd?
> 
> Why can't the execution model (normal, vector-single and
> worker-single) be simply attributes on functions or cgraph node flags
> and the kind of #acc loop simply be flags on struct loop, like
> already OpenMP simd / Cilk+ simd is?

One problem is that (at least on the GPU hardware we've considered so
far) we're somewhat constrained in how much control we have over how the
underlying hardware executes code: it's possible to draw up a scheme
where OpenACC source-level control-flow semantics are reflected directly
in the PTX assembly output (e.g. to say "all threads in a CTA/warp will
be coherent after such-and-such a loop"), and lowering OpenACC
directives quite early seems to make that relatively tractable. (Even
if the resulting code is relatively un-optimisable due to the abnormal
edges inserted to make sure that the CFG doesn't become "ill-formed".)

If arbitrary optimisations are done between OMP-lowering time and
somewhere around vectorisation (say), it's less clear if that
correspondence can be maintained. Say if the code executed by half the
threads in a warp becomes physically separated from the code executed
by the other half of the threads in a warp due to some loop
optimisation, we can no longer easily determine where that warp will
reconverge, and certain other operations (relying on coherent warps --
e.g. CTA synchronisation) become impossible. A similar issue exists for
warps within a CTA.

So, essentially -- I don't know how "late" loop lowering would interact
with:

(a) Maintaining a CFG that will work with PTX.

(b) Predication for worker-single and/or vector-single modes
(actually all currently-proposed schemes have problems with proper
representation of data-dependencies for variables and
compiler-generated temporaries between predicated regions.)

Julian

Reply via email to