On September 16, 2014 5:34:15 PM CEST, Tom de Vries <tom_devr...@mentor.com> 
wrote:
>On 09-09-14 12:56, Richard Biener wrote:
>> On Tue, 9 Sep 2014, Tom de Vries wrote:
>>
>>> On 18-08-14 14:16, Tom de Vries wrote:
>>>> On 06-08-14 17:10, Tom de Vries wrote:
>>>>> We could insert a pass-group here that only deals with functions
>that have
>>>>> the
>>>>> kernels directive, and do the auto-par thing in a
>pass_oacc_kernels (which
>>>>> should share the majority of the infrastructure with the parloops
>pass):
>>>>> ...
>>>>>             NEXT_PASS (pass_build_ealias);
>>>>>             INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
>>>>>                NEXT_PASS (pass_ch);
>>>>>                NEXT_PASS (pass_ccp);
>>>>>                NEXT_PASS (pass_lim_aux);
>>>>>                NEXT_PASS (pass_oacc_par);
>>>>>             POP_INSERT_PASSES ()
>>>>> ...
>>>>>
>>>>> Any comments, ideas or suggestions ?
>>>>
>>>> I've experimented with implementing this on top of gomp-4_0-branch,
>and I
>>>> ran
>>>> into PR46032.
>>>>
>>>> PR46032 is about vectorization failure on a function split off by
>omp
>>>> parallelization. The vectorization fails due to aliasing
>constraints in the
>>>> split off function, which are not present in the original code.
>>
>> Heh.  At least the omp-low.c parts from comment #1 should be pushed
>> to trunk...
>>
>
>Hi Richard,
>
>Right, but the intra_create_variable_infos part does not apply cleanly,
>and I 
>don't know yet how to resolve that.

That part isno longer necessary.

I'll followup with the rest of the mail after I return from vacation.

Richard.

>>>> In the gomp-4_0-branch, the code marked by the openacc kernels
>directive is
>>>> split off during omp_expand. The generated code has the same
>additional
>>>> aliasing
>>>> constraints, and in pass_oacc_par the parallelization fails.
>>>>
>>>> The PR46032 contains a tentative patch by Richard Biener, which
>applies
>>>> cleanly
>>>> on top of 4.6 (I haven't yet reached a level of understanding of
>>>> tree-ssa-structalias.c to be able to resolve the conflict in
>>>> intra_create_variable_infos when applying on 4.7). The tentative
>patch
>>>> involves
>>>> running ipa-pta, which is also a pass run after the point where we
>write out
>>>> the
>>>> lto stream. I'm not sure whether it makes sense to run the pta-ipa
>pass as
>>>> part
>>>> of the pass_oacc_kernels pass list.
>>
>> No, that's not even possible I think.
>>
>
>OK, thanks for confirming that.
>
>>>> I see three ways of continuing from here:
>>>> - take the tentative patch and make it work, including running
>pta-ipa
>>>> during
>>>>     passes_oacc_kernels
>>>> - same, but try somehow to manage without running pta-ipa.
>>>> - try to postpone splitting of the function until the end of
>pass_oacc_par.
>>
>> I don't understand the last option?  What is the actual issue you run
>> into?  You split oacc kernels off and _then_ run "autopar" on the
>> split-off function (and get additional kernels)?
>>
>
>Let me try to reiterate the problem in more detail.
>
>We're trying to implement the auto-parallelization part of the oacc
>kernels 
>directive using the existing parloops pass. The source starting point
>is the 
>gomp-4_0-branch.  The gomp-4_0-branch has a dummy implementation of the
>oacc 
>kernels directive, analogous to the oacc parallel directive.
>
>So the current gomp-4_0-branch does the following steps for oacc 
>parallel/kernels directives:
>1. pass_lower_omp/scan_omp:
>    - create record type with rewrite vars (.omp_data_t).
>    - declare function with arg with type pointer to .omp_data_t.
>2. pass_lower_omp/lower_omp:
>    - rewrite region in terms of rewrite vars
>    - add omp_return at end
>3. pass_expand_omp:
>    - split off the region into a separate function
>- replace region with call to GOACC_parallel/GOACC_kernels, with
>function
>      pointer as argument
>
>I wrote an example with a single oacc kernels region containing a
>simple vector 
>addition loop, and tried to make auto-parallelization work.
>
>The first problem I ran into was that the parloops pass failed to
>analyze the 
>dependencies in an vector addition example, due to the fact that the
>region was 
>already split off into a separate function, similar to PR46032.
>
>I looked briefly into the patches set in PR46032, but I realized that
>even if I 
>fix it, the next problem I run into will be that the parloops pass is
>run after 
>the lto stream read/write point. So any changes the parloops pass makes
>at that 
>point are in the accelerator compile flow, in other words we're talking
>about 
>launching an accelerator kernel from the accelerator. While that is
>possible 
>with recent cuda accelerators, I guess in general we should not expect
>that to 
>be possible.
>[ I also thought of a fancy scheme where we don't split off a new
>function, but 
>manipulate the body of the already split off function, and emit a c
>file from 
>the accelerator compiler containing the parameters that the host
>compiler should 
>use to launch the accelerator kernel... but I guess that would be a
>last resort. ]
>
>So in order to solve the lto stream read/write point problem, I moved
>the 
>parloops pass (well, a copy called pass_oacc_par or similar) up in the
>pass 
>list, to before lto stream read/write point. That precludes solving the
>alias 
>problem with the PR46032 patch set, since we need ipa for that.
>
>I solved (well, rather prevented) the alias problem by disabling
>pass_omp_expand 
>for GIMPLE_OACC_KERNELS, in other words disabling the
>function-split-off in 
>pass_omp_expand and letting pass_oacc_par take care of that (This is
>what I 
>meant with: 'postpone splitting of the function until the end of
>pass_oacc_par').
>Doing so required me to write a patch to handle omp-lowered code
>conservatively 
>in cpp and forwprop, otherwise the 'rewrite region in terms of rewrite
>vars' 
>would be undone by the time we arrive at pass_oacc_par.
>
>>>> Some advice on how to continue from here would be *highly*
>appreciated. My
>>>> hunch
>>>> atm is to investigate the last option.
>>>>
>>>
>>> Jakub,
>>> Richard,
>>>
>>> I've investigated the last option, and published the current state
>in git-only
>>> branch vries/oacc-kernels (
>>>
>https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/vries/oacc-kernels
>>> ).
>>>
>>> The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35
>is that:
>>> - a simple loop marked with the oacc kernels directive is analyzed
>for
>>>     parallelization,
>>> - the loop is then rewritten using oacc parallel and oacc loop
>directives
>>> - these oacc directives are expanded using omp_expand_local
>>> - this results in the loop being split off into a separate function,
>while
>>>     the loop is replaced with a GOACC_parallel call
>>> - all this is done before writing out the lto stream
>>> - no support yet for reductions, nested loops, more than one loop
>nest in
>>>    kernels region
>>>
>>> At toplevel, the added pass list looks like this:
>>> ...
>>>            NEXT_PASS (pass_build_ealias);
>>>            /* Pass group that runs when there are oacc kernels in
>the
>>>               function.  */
>>
>> Not sure why pass_oacc_kernels runs before all the other local
>> cleanups?  I would have put it after pass_cd_dce at least.
>>
>
>My focus was on running pass_oacc_kernels ASAP, in order not to have to
>adapt 
>more passes to leave the omp-lowered code alone. I'll give your
>suggestion a try.
>
>>>            NEXT_PASS (pass_oacc_kernels);
>>>            PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
>>>                NEXT_PASS (pass_ch_oacc_kernels);
>>>                NEXT_PASS (pass_tree_loop_init);
>>>                NEXT_PASS (pass_lim);
>>>                NEXT_PASS (pass_ccp);
>>>                NEXT_PASS (pass_parallelize_loops_oacc_kernels);
>>>                NEXT_PASS (pass_tree_loop_done);
>>>            POP_INSERT_PASSES ()
>>>   ...
>>>
>>> The main question I'm currently facing is the following: when to do
>lowering
>>> (in other words, rewriting of variable access in terms of .omp_data)
>of the
>>> kernels region. There are basically 2 passes that contain code to do
>this:
>>> - pass_lower_omp (on pre-ssa code)
>>> - pass_parallelize_loops (on ssa code)
>>
>> Both use the same utilities.
>>
>
>I think you mean that both passes use the same utilities to do
>omp-expand (in 
>other words, pass_parallelize_loops uses omp_expand_local).
>But AFAIU, the omp-lowering in pass_parallelize_loops (in particular,
>the 
>rewrite of the region in terms of rewrite vars) shares no code with the
>omp pass.
>
>>> Atm I'm using pass_loswer_omp, and I've added a patch that handles
>omp-lowered
>>> code conservatively in ccp and forwprop in order for the lowering to
>remain
>>> until arriving at pass_parallelize_loops_oacc_kernels.
>>
>> You mean omp-_un_-lowered code?
>>
>
>No, I mean pass_omp_lower lowers the code into omp-lowered code, and
>the patch 
>in question prevents cpp and forwprop from undoing the lowering before
>arriving 
>at the point where we split off the function.
>
>>> But it might turn out to be easier/necessary to handle this in
>>> pass_parallelize_loops_oacc_kernels instead.
>>
>> I'd do it similar to how autopar does it
>
>OK, I'll try then to do the lowering for the kernels region in 
>pass_parallelize_loops_oacc_kernels, not in pass_omp_lower.
>
>FWIW, I'm looking now into reductions, and started thinking in the same
>direction.
>
>> (not that autopar is a great
>> example for a GCC pass these days...).
>>
>
>For my understanding, could you briefly elaborate on that (or give a
>reference 
>to an earlier discussion)?
>
>Thanks,
>- Tom
>
>> Richard.
>>
>>> Any advice on this issue, and on the current implementation is
>welcome.
>>>
>>> Thanks,
>>> - Tom


Reply via email to