Re: [committed] Add oacc_kernels_p argument to pass_parallelize_loops

2016-01-20 Thread Thomas Schwinge
Hi!

On Mon, 18 Jan 2016 14:07:11 +0100, Tom de Vries  wrote:
> Add oacc_kernels_p argument to pass_parallelize_loops

> --- a/gcc/tree-parloops.c
> +++ b/gcc/tree-parloops.c

> @@ -2315,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,

|   /* Ensure that the exit condition is the first statement in the loop.
|  The common case is that latch of the loop is empty (apart from the
|  increment) and immediately follows the loop exit test.  Attempt to move 
the
|  entry of the loop directly before the exit check and increase the number 
of
|  iterations of the loop by one.  */
|   if (try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
| {
|   if (dump_file
| && (dump_flags & TDF_DETAILS))
|   fprintf (dump_file,
|"alternative exit-first loop transform succeeded"
|" for loop %d\n", loop->num);
| }
|   else
| {
> +  if (oacc_kernels_p)
> + n_threads = 1;
> +
|   /* Fall back on the method that handles more cases, but duplicates the
|loop body: move the exit condition of LOOP to the beginning of its
|header, and duplicate the part of the last iteration that gets disabled
|to the exit of the loop.  */
|   transform_to_exit_first_loop (loop, reduction_list, nit);
| }

Just for my own education: this pessimization "n_threads = 1" for OpenACC
kernels is because the duplicated loop bodies generated by
transform_to_exit_first_loop are not appropriate for parallel OpenACC
offloading execution?  (Might add a source code comment here?)  Testing
on gomp-4_0-branch, there are no changes in the testsuite if I remove
this hunk.


Grüße
 Thomas


signature.asc
Description: PGP signature


Re: [committed] Add oacc_kernels_p argument to pass_parallelize_loops

2016-01-20 Thread Tom de Vries

On 20/01/16 09:54, Thomas Schwinge wrote:

Hi!

On Mon, 18 Jan 2016 14:07:11 +0100, Tom de Vries  wrote:

Add oacc_kernels_p argument to pass_parallelize_loops



--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c



@@ -2315,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,


|   /* Ensure that the exit condition is the first statement in the loop.
|  The common case is that latch of the loop is empty (apart from the
|  increment) and immediately follows the loop exit test.  Attempt to move 
the
|  entry of the loop directly before the exit check and increase the number 
of
|  iterations of the loop by one.  */
|   if (try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
| {
|   if (dump_file
| && (dump_flags & TDF_DETAILS))
|   fprintf (dump_file,
|"alternative exit-first loop transform succeeded"
|" for loop %d\n", loop->num);
| }
|   else
| {

+  if (oacc_kernels_p)
+   n_threads = 1;
+

|   /* Fall back on the method that handles more cases, but duplicates the
|loop body: move the exit condition of LOOP to the beginning of its
|header, and duplicate the part of the last iteration that gets disabled
|to the exit of the loop.  */
|   transform_to_exit_first_loop (loop, reduction_list, nit);
| }

Just for my own education: this pessimization "n_threads = 1" for OpenACC
kernels is because the duplicated loop bodies generated by
transform_to_exit_first_loop are not appropriate for parallel OpenACC
offloading execution?


In the case of standard parloops, only the loop is executed in parallel, 
so the duplicated loop body is outside the parallel region.


In the case of oacc parloops, the duplicated body is included in the 
kernels region, and executed in parallel.


The duplicated body for the last iteration can be executed in parallel 
with the loop body in the loop for all the other iterations. We've done 
the dependency analysis for that.


But the duplicated loop body for the last iteration is now executed in 
parallel with itself as well. We've got code that deals with that by 
guarding the side-effects such that they're only executed for a single 
gang. But that code is atm only effective in oacc_entry_exit_ok, before 
transform_to_exit_first_loop_alt introduces the duplicated loop body.



(Might add a source code comment here?)  Testing
on gomp-4_0-branch, there are no changes in the testsuite if I remove
this hunk.


If you want to see the effect of removing the 'n_threads = 1' hunk, make 
try_transform_to_exit_first_loop_alt always return false.


I expect a loop
  for (i = 0; i < N; ++i)
a[i] = a[i] + 1;
would give incorrect results in a[N - 1].

Thanks,
- Tom


[committed] Add oacc_kernels_p argument to pass_parallelize_loops

2016-01-18 Thread Tom de Vries

[was: Re: [PIING][PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels ]

On 14/12/15 16:22, Richard Biener wrote:

On Sun, Dec 13, 2015 at 5:58 PM, Tom de Vries  wrote:

On 24/11/15 13:24, Tom de Vries wrote:


On 16/11/15 12:59, Tom de Vries wrote:


On 09/11/15 20:52, Tom de Vries wrote:


On 09/11/15 16:35, Tom de Vries wrote:


Hi,

this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.

The patch series contains these patches:

   1Insert new exit block only when needed in
  transform_to_exit_first_loop_alt
   2Make create_parallel_loop return void
   3Ignore reduction clause on kernels directive
   4Implement -foffload-alias
   5Add in_oacc_kernels_region in struct loop
   6Add pass_oacc_kernels
   7Add pass_dominator_oacc_kernels
   8Add pass_ch_oacc_kernels
   9Add pass_parallelize_loops_oacc_kernels
  10Add pass_oacc_kernels pass group in passes.def
  11Update testcases after adding kernels pass group
  12Handle acc loop directive
  13Add c-c++-common/goacc/kernels-*.c
  14Add gfortran.dg/goacc/kernels-*.f95
  15Add libgomp.oacc-c-c++-common/kernels-*.c
  16Add libgomp.oacc-fortran/kernels-*.f95

The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.

Bootstrapped and reg-tested on x86_64.

Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).

I'll post the individual patches in reply to this message.



This patch adds pass_parallelize_loops_oacc_kernels.

There's a number of things we do differently in parloops for oacc
kernels:
- in normal parloops, we generate code to choose between a parallel
version of the loop, and a sequential (low iteration count) version.
Since the code in oacc kernels region is supposed to run on the
accelerator anyway, we skip this check, and don't add a low iteration
count loop.
- in normal parloops, we generate an #pragma omp parallel /
GIMPLE_OMP_RETURN pair to delimit the region which will we split off
into a thread function. Since the oacc kernels region is already
split off, we don't add this pair.
- we indicate the parallelization factor by setting the oacc function
attributes
- we generate an #pragma oacc loop instead of an #pragma omp for, and
we add the gang clause
- in normal parloops, we rewrite the variable accesses in the loop in
terms into accesses relative to a thread function parameter. For the
oacc kernels region, that rewrite has already been done at omp-lower,
so we skip this.
- we need to ensure that the entire kernels region can be run in
parallel. The loop independence check is already present, so for oacc
kernels we add a check between blocks outside the loop and the entire
region.
- we guard stores in the blocks outside the loop with gang_pos == 0.
There's no need for each gang to write to a single location, we can
do this in just one gang. (Typically this is the write of the final
value of the iteration variable if that one is copied back to the
host).



Reposting with loop optimizer init added in
pass_parallelize_loops_oacc_kernels::execute.



Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize
   added in pass_parallelize_loops_oacc_kernels::execute.



Ping.

Anything I can do to facilitate the review?


Document new functions.


Done.

avoid if (1).

Done.


Ideally some refactoring would avoid some of the if (!oacc_kernels_p) spaghetti


Ack. For now, i've tried to minimize the number of oacc_kernels_p tests 
in the code.


Further suggestions on how to improve here are much appreciated.


but I'm considering tree-parloops.c (and its bugs) yours.


Ack.


Can the pass not just use a pass parameter to switch between oacc/non-oacc?



This patch introduces the pass parameter oacc_kernels_p (but does not 
instantiate an oacc_kernels_p == true pass version yet).


Bootstrapped and reg-tested on x86_64.

Committed to trunk.

Thanks,
- Tom

Add oacc_kernels_p argument to pass_parallelize_loops

2015-11-09  Tom de Vries  

	* omp-low.c (set_oacc_fn_attrib): Make extern.
	* omp-low.h (set_oacc_fn_attrib): Declare.
	* tree-parloops.c (struct reduction_info): Add reduc_addr field.
	(create_call_for_reduction_1): Handle case that reduc_addr is non-NULL.
	(create_parallel_loop, gen_parallel_loop, try_create_reduction_list):
	Add and handle function parameter oacc_kernels_p.
	(find_reduc_addr, get_omp_data_i_param): New function.
	(ref_conflicts_with_region, oacc_entry_exit_ok_1)
	(oacc_entry_exit_single_gang, oacc_entry_exit_ok): New function.
	(parallelize_loops): Add and handle