On September 16, 2014 5:34:15 PM CEST, Tom de Vries <tom_devr...@mentor.com> wrote: >On 09-09-14 12:56, Richard Biener wrote: >> On Tue, 9 Sep 2014, Tom de Vries wrote: >> >>> On 18-08-14 14:16, Tom de Vries wrote: >>>> On 06-08-14 17:10, Tom de Vries wrote: >>>>> We could insert a pass-group here that only deals with functions >that have >>>>> the >>>>> kernels directive, and do the auto-par thing in a >pass_oacc_kernels (which >>>>> should share the majority of the infrastructure with the parloops >pass): >>>>> ... >>>>> NEXT_PASS (pass_build_ealias); >>>>> INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels) >>>>> NEXT_PASS (pass_ch); >>>>> NEXT_PASS (pass_ccp); >>>>> NEXT_PASS (pass_lim_aux); >>>>> NEXT_PASS (pass_oacc_par); >>>>> POP_INSERT_PASSES () >>>>> ... >>>>> >>>>> Any comments, ideas or suggestions ? >>>> >>>> I've experimented with implementing this on top of gomp-4_0-branch, >and I >>>> ran >>>> into PR46032. >>>> >>>> PR46032 is about vectorization failure on a function split off by >omp >>>> parallelization. The vectorization fails due to aliasing >constraints in the >>>> split off function, which are not present in the original code. >> >> Heh. At least the omp-low.c parts from comment #1 should be pushed >> to trunk... >> > >Hi Richard, > >Right, but the intra_create_variable_infos part does not apply cleanly, >and I >don't know yet how to resolve that.
That part isno longer necessary. I'll followup with the rest of the mail after I return from vacation. Richard. >>>> In the gomp-4_0-branch, the code marked by the openacc kernels >directive is >>>> split off during omp_expand. The generated code has the same >additional >>>> aliasing >>>> constraints, and in pass_oacc_par the parallelization fails. >>>> >>>> The PR46032 contains a tentative patch by Richard Biener, which >applies >>>> cleanly >>>> on top of 4.6 (I haven't yet reached a level of understanding of >>>> tree-ssa-structalias.c to be able to resolve the conflict in >>>> intra_create_variable_infos when applying on 4.7). The tentative >patch >>>> involves >>>> running ipa-pta, which is also a pass run after the point where we >write out >>>> the >>>> lto stream. I'm not sure whether it makes sense to run the pta-ipa >pass as >>>> part >>>> of the pass_oacc_kernels pass list. >> >> No, that's not even possible I think. >> > >OK, thanks for confirming that. > >>>> I see three ways of continuing from here: >>>> - take the tentative patch and make it work, including running >pta-ipa >>>> during >>>> passes_oacc_kernels >>>> - same, but try somehow to manage without running pta-ipa. >>>> - try to postpone splitting of the function until the end of >pass_oacc_par. >> >> I don't understand the last option? What is the actual issue you run >> into? You split oacc kernels off and _then_ run "autopar" on the >> split-off function (and get additional kernels)? >> > >Let me try to reiterate the problem in more detail. > >We're trying to implement the auto-parallelization part of the oacc >kernels >directive using the existing parloops pass. The source starting point >is the >gomp-4_0-branch. The gomp-4_0-branch has a dummy implementation of the >oacc >kernels directive, analogous to the oacc parallel directive. > >So the current gomp-4_0-branch does the following steps for oacc >parallel/kernels directives: >1. pass_lower_omp/scan_omp: > - create record type with rewrite vars (.omp_data_t). > - declare function with arg with type pointer to .omp_data_t. >2. pass_lower_omp/lower_omp: > - rewrite region in terms of rewrite vars > - add omp_return at end >3. pass_expand_omp: > - split off the region into a separate function >- replace region with call to GOACC_parallel/GOACC_kernels, with >function > pointer as argument > >I wrote an example with a single oacc kernels region containing a >simple vector >addition loop, and tried to make auto-parallelization work. > >The first problem I ran into was that the parloops pass failed to >analyze the >dependencies in an vector addition example, due to the fact that the >region was >already split off into a separate function, similar to PR46032. > >I looked briefly into the patches set in PR46032, but I realized that >even if I >fix it, the next problem I run into will be that the parloops pass is >run after >the lto stream read/write point. So any changes the parloops pass makes >at that >point are in the accelerator compile flow, in other words we're talking >about >launching an accelerator kernel from the accelerator. While that is >possible >with recent cuda accelerators, I guess in general we should not expect >that to >be possible. >[ I also thought of a fancy scheme where we don't split off a new >function, but >manipulate the body of the already split off function, and emit a c >file from >the accelerator compiler containing the parameters that the host >compiler should >use to launch the accelerator kernel... but I guess that would be a >last resort. ] > >So in order to solve the lto stream read/write point problem, I moved >the >parloops pass (well, a copy called pass_oacc_par or similar) up in the >pass >list, to before lto stream read/write point. That precludes solving the >alias >problem with the PR46032 patch set, since we need ipa for that. > >I solved (well, rather prevented) the alias problem by disabling >pass_omp_expand >for GIMPLE_OACC_KERNELS, in other words disabling the >function-split-off in >pass_omp_expand and letting pass_oacc_par take care of that (This is >what I >meant with: 'postpone splitting of the function until the end of >pass_oacc_par'). >Doing so required me to write a patch to handle omp-lowered code >conservatively >in cpp and forwprop, otherwise the 'rewrite region in terms of rewrite >vars' >would be undone by the time we arrive at pass_oacc_par. > >>>> Some advice on how to continue from here would be *highly* >appreciated. My >>>> hunch >>>> atm is to investigate the last option. >>>> >>> >>> Jakub, >>> Richard, >>> >>> I've investigated the last option, and published the current state >in git-only >>> branch vries/oacc-kernels ( >>> >https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/vries/oacc-kernels >>> ). >>> >>> The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35 >is that: >>> - a simple loop marked with the oacc kernels directive is analyzed >for >>> parallelization, >>> - the loop is then rewritten using oacc parallel and oacc loop >directives >>> - these oacc directives are expanded using omp_expand_local >>> - this results in the loop being split off into a separate function, >while >>> the loop is replaced with a GOACC_parallel call >>> - all this is done before writing out the lto stream >>> - no support yet for reductions, nested loops, more than one loop >nest in >>> kernels region >>> >>> At toplevel, the added pass list looks like this: >>> ... >>> NEXT_PASS (pass_build_ealias); >>> /* Pass group that runs when there are oacc kernels in >the >>> function. */ >> >> Not sure why pass_oacc_kernels runs before all the other local >> cleanups? I would have put it after pass_cd_dce at least. >> > >My focus was on running pass_oacc_kernels ASAP, in order not to have to >adapt >more passes to leave the omp-lowered code alone. I'll give your >suggestion a try. > >>> NEXT_PASS (pass_oacc_kernels); >>> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) >>> NEXT_PASS (pass_ch_oacc_kernels); >>> NEXT_PASS (pass_tree_loop_init); >>> NEXT_PASS (pass_lim); >>> NEXT_PASS (pass_ccp); >>> NEXT_PASS (pass_parallelize_loops_oacc_kernels); >>> NEXT_PASS (pass_tree_loop_done); >>> POP_INSERT_PASSES () >>> ... >>> >>> The main question I'm currently facing is the following: when to do >lowering >>> (in other words, rewriting of variable access in terms of .omp_data) >of the >>> kernels region. There are basically 2 passes that contain code to do >this: >>> - pass_lower_omp (on pre-ssa code) >>> - pass_parallelize_loops (on ssa code) >> >> Both use the same utilities. >> > >I think you mean that both passes use the same utilities to do >omp-expand (in >other words, pass_parallelize_loops uses omp_expand_local). >But AFAIU, the omp-lowering in pass_parallelize_loops (in particular, >the >rewrite of the region in terms of rewrite vars) shares no code with the >omp pass. > >>> Atm I'm using pass_loswer_omp, and I've added a patch that handles >omp-lowered >>> code conservatively in ccp and forwprop in order for the lowering to >remain >>> until arriving at pass_parallelize_loops_oacc_kernels. >> >> You mean omp-_un_-lowered code? >> > >No, I mean pass_omp_lower lowers the code into omp-lowered code, and >the patch >in question prevents cpp and forwprop from undoing the lowering before >arriving >at the point where we split off the function. > >>> But it might turn out to be easier/necessary to handle this in >>> pass_parallelize_loops_oacc_kernels instead. >> >> I'd do it similar to how autopar does it > >OK, I'll try then to do the lowering for the kernels region in >pass_parallelize_loops_oacc_kernels, not in pass_omp_lower. > >FWIW, I'm looking now into reductions, and started thinking in the same >direction. > >> (not that autopar is a great >> example for a GCC pass these days...). >> > >For my understanding, could you briefly elaborate on that (or give a >reference >to an earlier discussion)? > >Thanks, >- Tom > >> Richard. >> >>> Any advice on this issue, and on the current implementation is >welcome. >>> >>> Thanks, >>> - Tom