On 20/01/16 09:54, Thomas Schwinge wrote:
Hi!

On Mon, 18 Jan 2016 14:07:11 +0100, Tom de Vries <tom_devr...@mentor.com> wrote:
Add oacc_kernels_p argument to pass_parallelize_loops

--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c

@@ -2315,6 +2367,9 @@ gen_parallel_loop (struct loop *loop,

|   /* Ensure that the exit condition is the first statement in the loop.
|      The common case is that latch of the loop is empty (apart from the
|      increment) and immediately follows the loop exit test.  Attempt to move 
the
|      entry of the loop directly before the exit check and increase the number 
of
|      iterations of the loop by one.  */
|   if (try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
|     {
|       if (dump_file
|         && (dump_flags & TDF_DETAILS))
|       fprintf (dump_file,
|                "alternative exit-first loop transform succeeded"
|                " for loop %d\n", loop->num);
|     }
|   else
|     {
+      if (oacc_kernels_p)
+       n_threads = 1;
+
|       /* Fall back on the method that handles more cases, but duplicates the
|        loop body: move the exit condition of LOOP to the beginning of its
|        header, and duplicate the part of the last iteration that gets disabled
|        to the exit of the loop.  */
|       transform_to_exit_first_loop (loop, reduction_list, nit);
|     }

Just for my own education: this pessimization "n_threads = 1" for OpenACC
kernels is because the duplicated loop bodies generated by
transform_to_exit_first_loop are not appropriate for parallel OpenACC
offloading execution?

In the case of standard parloops, only the loop is executed in parallel, so the duplicated loop body is outside the parallel region.

In the case of oacc parloops, the duplicated body is included in the kernels region, and executed in parallel.

The duplicated body for the last iteration can be executed in parallel with the loop body in the loop for all the other iterations. We've done the dependency analysis for that.

But the duplicated loop body for the last iteration is now executed in parallel with itself as well. We've got code that deals with that by guarding the side-effects such that they're only executed for a single gang. But that code is atm only effective in oacc_entry_exit_ok, before transform_to_exit_first_loop_alt introduces the duplicated loop body.

(Might add a source code comment here?)  Testing
on gomp-4_0-branch, there are no changes in the testsuite if I remove
this hunk.

If you want to see the effect of removing the 'n_threads = 1' hunk, make try_transform_to_exit_first_loop_alt always return false.

I expect a loop
  for (i = 0; i < N; ++i)
    a[i] = a[i] + 1;
would give incorrect results in a[N - 1].

Thanks,
- Tom

Reply via email to