On 06/29/2016 07:11 AM, Thomas Schwinge wrote: > Cesar, I have not yet fully digested this, but do I understand right that > you're really fixing two issues here, that are related (OpenACC routines) > but still can be addressed independently of each other? Do I understand > right that the first one, the "problems with acc routines [...] > incorrectly permitting 'acc seq' loops to call gang, worker and vector > routines" is just a Fortran front end patch? If yes, please split that > one out, so as to reduce the volume of remaining changes that remain to > be discussed.
This patch addresses the following issues: 1. Issues warnings when a non-acc routine function is called inside an OpenACC offloaded region. 2. It corrects a bug what was allowing seq loops to call gang, worker and vector routines. 3. It adds supports for acc routines in fortran modules (which I noticed was missing when I added 'acc routine seq' to acc_on_device in the fortran openacc include files). I'll split these into separate patches. > On Thu, 23 Jun 2016 09:05:38 -0700, Cesar Philippidis > <ce...@codesourcery.com> wrote: >> On 06/17/2016 07:42 AM, Jakub Jelinek wrote: >>> On Wed, Jun 15, 2016 at 08:12:15PM -0700, Cesar Philippidis wrote: >>>> The second set of changes involves teaching the gimplifier to error when >>>> it detects a function call to an non-acc routines inside an OpenACC >>>> offloaded region. > > As I understand, that's the same problem as has been discussed before > (Ilya CCed), and has recently again been filed in > <https://gcc.gnu.org/PR71499> "ICE in LTO1 when attempting NVPTX > offloading (-fopenacc)", and <https://gcc.gnu.org/PR71535> "ICE in LTO1 > with -fopenmp offloading" (Alexander CCed). Some earlier discussion > threads include: > <http://news.gmane.org/find-root.php?message_id=%3C20150109145702.GA45210%40msticlxl57.ims.intel.com%3E>, > <http://news.gmane.org/find-root.php?message_id=%3C20150724152119.GA41292%40msticlxl57.ims.intel.com%3E>, > <http://news.gmane.org/find-root.php?message_id=%3C56269C05.6030502%40acm.org%3E>. > >>>> Actually, I relaxed non-acc routines by excluding >>>> calls to builtin functions, including those prefixed with _gfortran_. >>>> Nvptx does have a newlib c library, and it also has a subset of >>>> libgfortran. Still, this solution is probably not optimal. >>> >>> I don't really like that, hardcoding prefixes or whatever is available >>> (you have quite some subset of libc, libm etc. available too) in the >>> compiler looks very hackish. What is wrong with complaining during >>> linking of the offloaded code? > > ACK. Jakub, do I understand you correctly, that you basically say that > every function declaration that is in scope inside offloaded regions (for > example, GCC builtin functions, or standard library functions declared in > target compiler's header files) is permitted to be called in offloaded > regions, and the offloading compiler will then either be able to resolve > these (nvptx back end knows about trigonometric functions, for example, > and a lot of functions are available in the nvptx libc), or otherwise > error out during the offloading compilation (during linking), gracefully > without terminating the target compilation (that "gracefully" bit is > currently missing -- that's for another day). That is, all such > functions are implicitly callable as OpenACC "seq" functions (which means > that they don't internally use gang/worker/vector parallelism). In > particular, all these functions do *not* need to be marked with an > explicit "#pragma acc routine seq" directive. (Functions internally > using gang/worker/vector parallelism will need to be marked > appropriately, using a "#pragma acc routine gang/worker/vector" > directive.) That's how I understand your comment above, and your earlier > comments on this topic, and also is what I think should be done. OK. I'll drop the warning changes from my patch set then unless you want to keep it. > A few random comments on the patch: > >> --- a/gcc/fortran/gfortran.h >> +++ b/gcc/fortran/gfortran.h >> @@ -303,6 +303,15 @@ enum save_state >> { SAVE_NONE = 0, SAVE_EXPLICIT, SAVE_IMPLICIT >> }; >> >> +/* Flags to keep track of ACC routine states. */ >> +enum oacc_function >> +{ OACC_FUNCTION_NONE = 0, >> + OACC_FUNCTION_SEQ, >> + OACC_FUNCTION_GANG, >> + OACC_FUNCTION_WORKER, >> + OACC_FUNCTION_VECTOR >> +}; > > What's the purpose of OACC_FUNCTION_NONE? It's not used anywhere, as far > as I can tell? It's used by the fortran module code. It controls how parallelism gets encoded in the .mod files. >> --- a/gcc/fortran/openmp.c >> +++ b/gcc/fortran/openmp.c >> @@ -1664,21 +1664,31 @@ gfc_match_oacc_cache (void) >> >> /* Determine the loop level for a routine. */ >> >> -static int >> +static oacc_function >> gfc_oacc_routine_dims (gfc_omp_clauses *clauses) >> { >> int level = -1; >> + oacc_function ret = OACC_FUNCTION_SEQ; >> >> if (clauses) >> { >> unsigned mask = 0; >> >> if (clauses->gang) >> - level = GOMP_DIM_GANG, mask |= GOMP_DIM_MASK (level); >> + { >> + level = GOMP_DIM_GANG, mask |= GOMP_DIM_MASK (level); >> + ret = OACC_FUNCTION_GANG; >> + } >> if (clauses->worker) >> - level = GOMP_DIM_WORKER, mask |= GOMP_DIM_MASK (level); >> + { >> + level = GOMP_DIM_WORKER, mask |= GOMP_DIM_MASK (level); >> + ret = OACC_FUNCTION_WORKER; >> + } >> if (clauses->vector) >> - level = GOMP_DIM_VECTOR, mask |= GOMP_DIM_MASK (level); >> + { >> + level = GOMP_DIM_VECTOR, mask |= GOMP_DIM_MASK (level); >> + ret = OACC_FUNCTION_VECTOR; >> + } >> if (clauses->seq) >> level = GOMP_DIM_MAX, mask |= GOMP_DIM_MASK (level); >> > > I have not looked in detail, so maybe I'm misunderstanding what is being > done here -- but how do "clauses->seq" and "level = GOMP_DIM_MAX" fit > together? Conceptually, if you take a look at the oacc_function attribute in a tree dump, you'll see an array with three elements. Basically, each element in that array represents a gang, worker or vector parallelism. By definition, a gang loop permits a worker and vector loop to be nested inside it. So, for a gang routine, the oacc_function attribute is constructed such that it permits gang, worker and vector level parallelism. Similarly, for a worker routine, the oacc_function attribute has the worker and vector level parallelism 'bits' set. With that in mind, setting seq to GOMP_DIM_MASK allows the loop creating that oacc_function attribute to mask out any gang, worker and vector parallelism. >> @@ -1689,7 +1699,7 @@ gfc_oacc_routine_dims (gfc_omp_clauses *clauses) >> if (level < 0) >> level = GOMP_DIM_MAX; >> >> - return level; >> + return ret; >> } > > Just from that last hunk, it seems that the assignment to "level" is a > dead store? I'll need to check this when I split out the patch. >> +static tree >> +add_attributes_to_decl (symbol_attribute sym_attr, tree list) >> +{ >> + unsigned id; >> + tree attr; >> + >> + for (id = 0; id < EXT_ATTR_NUM; id++) >> + if (sym_attr.ext_attr & (1 << id)) >> + { >> + attr = build_tree_list ( >> + get_identifier (ext_attr_list[id].middle_end_name), >> + NULL_TREE); >> + list = chainon (list, attr); >> + } >> + >> + list = add_omp_offloading_attributes (sym_attr.omp_declare_target, >> + sym_attr.oacc_function, list); >> + >> + return list; >> +} > > Something that I had noticed before, possibly related here: code in > gcc/fortran/ does never call replace_oacc_fn_attrib, but the C and C++ > front ends do. Is that function what you've re-implemented here? Similar, but I broke this code out from another function to handle BUILT_IN_EXPECT. But I can revert this change now, since BUILT_IN_EXPECT will be treated as an implicit SEQ routine. >> --- a/gcc/lto-cgraph.c >> +++ b/gcc/lto-cgraph.c >> @@ -1201,9 +1201,11 @@ input_overwrite_node (struct lto_file_decl_data >> *file_data, >> LDPR_NUM_KNOWN); >> node->instrumentation_clone = bp_unpack_value (bp, 1); >> node->split_part = bp_unpack_value (bp, 1); >> - gcc_assert (flag_ltrans >> - || (!node->in_other_partition >> - && !node->used_from_other_partition)); >> + >> + int success = flag_ltrans || (!node->in_other_partition >> + && !node->used_from_other_partition); >> + if (!success) >> + error ("Missing %<%s%>", node->name ()); >> } >> >> /* Return string alias is alias of. */ >> @@ -1416,9 +1418,11 @@ input_varpool_node (struct lto_file_decl_data >> *file_data, >> node->set_section_for_node (section); >> node->resolution = streamer_read_enum (ib, ld_plugin_symbol_resolution, >> LDPR_NUM_KNOWN); >> - gcc_assert (flag_ltrans >> - || (!node->in_other_partition >> - && !node->used_from_other_partition)); >> + >> + int success = flag_ltrans || (!node->in_other_partition >> + && !node->used_from_other_partition); >> + if (!success) >> + error ("Missing %<%s%>", node->name ()); >> >> return node; >> } > > That looks similar to what I remember from earlier, simiar patches, as > referenced above. It is. I never got around to pushing that patch very strongly because I thought those link failures were legitimate compiler bugs. >> @@ -19420,7 +19442,8 @@ oacc_loop_fixed_partitions (oacc_loop *loop, >> unsigned outer_mask) >> { >> unsigned outermost = this_mask & -this_mask; >> >> - if (outermost && outermost <= outer_mask) >> + if ((outermost && outermost <= outer_mask) >> + || (this_mask && (loop->parent->flags & OLF_SEQ))) >> { >> if (noisy) >> { > >> --- a/gcc/testsuite/c-c++-common/goacc/routine-3.c >> +++ b/gcc/testsuite/c-c++-common/goacc/routine-3.c >> @@ -49,7 +49,7 @@ main () >> int red = 0; >> #pragma acc parallel copy (red) >> { >> - /* Independent/seq loop tests. */ >> + /* Independent loop tests. */ >> #pragma acc loop reduction (+:red) // { dg-warning "insufficient >> partitioning" } >> for (int i = 0; i < 10; i++) >> red += gang (); >> @@ -62,6 +62,19 @@ main () >> for (int i = 0; i < 10; i++) >> red += vector (); >> >> + /* Seq loop tests. */ >> +#pragma acc loop seq reduction (+:red) /* { dg-message "containing loop" } >> */ >> + for (int i = 0; i < 10; i++) >> + red += gang (); /* { dg-error "incorrectly nested" } */ >> + >> +#pragma acc loop seq reduction (+:red) /* { dg-message "containing loop" } >> */ >> + for (int i = 0; i < 10; i++) >> + red += worker (); /* { dg-error "incorrectly nested" } */ >> + >> +#pragma acc loop seq reduction (+:red) /* { dg-message "containing loop" } >> */ >> + for (int i = 0; i < 10; i++) >> + red += vector (); /* { dg-error "incorrectly nested" } */ >> + >> /* Gang routine tests. */ >> #pragma acc loop gang reduction (+:red) /* { dg-message "containing loop" >> } */ >> for (int i = 0; i < 10; i++) > > Do these test case changes actually relate to any of the compiler changes > discussed above? Maybe to the oacc_loop_fixed_partitions cited just > above? Is that a separate issue to fix? Eh, or is that actually the fix > for your first issue, the "problems with acc routines [...] incorrectly > permitting 'acc seq' loops to call gang, worker and vector routines"? This is issue 2, and I'll break it out into a separate patch. >> --- >> a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c >> +++ >> b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels-ipa-pta.c >> @@ -1,4 +1,4 @@ >> /* { dg-do run { target lto } } */ >> -/* { dg-additional-options "-fipa-pta -flto -flto-partition=max" } */ >> +/* { dg-additional-options "-fipa-pta -flto -flto-partition=max >> -fno-exceptions" } */ >> >> #include "data-clauses-kernels.c" > >> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels.c >> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-kernels.c >> @@ -1,2 +1,4 @@ >> +/* { dg-additional-options "-fno-exceptions" } */ >> + >> #define CONSTRUCT kernels >> #include "data-clauses.h" > >> --- >> a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c >> +++ >> b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel-ipa-pta.c >> @@ -1,4 +1,4 @@ >> /* { dg-do run { target lto } } */ >> -/* { dg-additional-options "-fipa-pta -flto -flto-partition=max" } */ >> +/* { dg-additional-options "-fipa-pta -flto -flto-partition=max >> -fno-exceptions" } */ >> >> #include "data-clauses-parallel.c" > >> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel.c >> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-clauses-parallel.c >> @@ -1,2 +1,4 @@ >> +/* { dg-additional-options "-fno-exceptions" } */ >> + >> #define CONSTRUCT parallel >> #include "data-clauses.h" > > Hmm? I'm not sure what happened here either. Maybe adding the 'acc routine' directive to acc_on_device is preventing that function from expanding to its builtin function counterpart, which caused gcc to generate exception code? Cesar