[PATCH, OpenACC] (1/2) Fix implicit mapping for array slices on lexically-enclosing data constructs (PR70828)
This patch implements support for array slices (with a non-zero base element) declared on OpenACC data constructs. Any lexically-enclosed parallel or kernels regions should "inherit" such mappings, e.g. if we have: #pragma acc data copy(arr[10:20]) { #pragma acc parallel loop for (...) { ...arr[X]... } } the mapping for "arr" on the data construct takes precedence over the default mapping behaviour for the parallel construct, which is to map the whole array. (OpenACC 2.5, "2.5.1. Parallel Construct" and elsewhere). Tested with offloading to nvptx. (This patch differs in implementation somewhat from the version on the gomp4, etc. branches.) OK to apply? Thanks, Julian 2018-08-28 Julian Brown Cesar Philippidis PR middle-end/70828 gcc/ * gimplify.c (gimplify_omp_ctx): Add decl_data_clause hash map. (new_omp_context): Initialise above. (delete_omp_context): Delete above. (gimplify_scan_omp_clauses): Scan for array mappings on data constructs, and record in above map. (gomp_needs_data_present): New function. (gimplify_adjust_omp_clauses_1): Handle data mappings (e.g. array slices) declared in lexically-enclosing data constructs. * omp-low.c (lower_omp_target): Allow decl for bias not to be present in omp context. gcc/testsuite/ * c-c++-common/goacc/acc-data-chain.c: New test. * gfortran.dg/goacc/pr70828.f90: New test. * gfortran.dg/goacc/pr70828-2.f90: New test. libgomp/ * testsuite/libgomp.oacc-c-c++-common/pr70828.c: New test. * testsuite/libgomp.oacc-fortran/implicit_copy.f90: New test. * testsuite/libgomp.oacc-fortran/pr70828.f90: New test. * testsuite/libgomp.oacc-fortran/pr70828-2.f90: New test. * testsuite/libgomp.oacc-fortran/pr70828-3.f90: New test. * testsuite/libgomp.oacc-fortran/pr70828-5.f90: New test. >From 9123c4ddd701c40c3e85a0c6cd327066542b9e7a Mon Sep 17 00:00:00 2001 From: Julian Brown Date: Thu, 16 Aug 2018 20:02:10 -0700 Subject: [PATCH 1/2] Inheritance of array sections on data constructs. 2018-08-28 Julian Brown Cesar Philippidis gcc/ * gimplify.c (gimplify_omp_ctx): Add decl_data_clause hash map. (new_omp_context): Initialise above. (delete_omp_context): Delete above. (gimplify_scan_omp_clauses): Scan for array mappings on data constructs, and record in above map. (gomp_needs_data_present): New function. (gimplify_adjust_omp_clauses_1): Handle data mappings (e.g. array slices) declared in lexically-enclosing data constructs. * omp-low.c (lower_omp_target): Allow decl for bias not to be present in omp context. gcc/testsuite/ * c-c++-common/goacc/acc-data-chain.c: New test. * gfortran.dg/goacc/pr70828.f90: New test. * gfortran.dg/goacc/pr70828-2.f90: New test. libgomp/ * testsuite/libgomp.oacc-c-c++-common/pr70828.c: New test. * testsuite/libgomp.oacc-fortran/implicit_copy.f90: New test. * testsuite/libgomp.oacc-fortran/pr70828.f90: New test. * testsuite/libgomp.oacc-fortran/pr70828-2.f90: New test. * testsuite/libgomp.oacc-fortran/pr70828-3.f90: New test. * testsuite/libgomp.oacc-fortran/pr70828-5.f90: New test. --- gcc/gimplify.c | 97 +- gcc/omp-low.c | 7 +- gcc/testsuite/c-c++-common/goacc/acc-data-chain.c | 24 ++ gcc/testsuite/gfortran.dg/goacc/pr70828.f90| 22 + .../libgomp.oacc-c-c++-common/pr70828-2.c | 34 .../testsuite/libgomp.oacc-c-c++-common/pr70828.c | 27 ++ .../libgomp.oacc-fortran/implicit_copy.f90 | 30 +++ .../testsuite/libgomp.oacc-fortran/pr70828-2.f90 | 31 +++ .../testsuite/libgomp.oacc-fortran/pr70828-3.f90 | 34 .../testsuite/libgomp.oacc-fortran/pr70828-5.f90 | 29 +++ libgomp/testsuite/libgomp.oacc-fortran/pr70828.f90 | 24 ++ 11 files changed, 354 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/goacc/acc-data-chain.c create mode 100644 gcc/testsuite/gfortran.dg/goacc/pr70828.f90 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr70828-2.c create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr70828.c create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/implicit_copy.f90 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/pr70828-2.f90 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/pr70828-3.f90 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/pr70828-5.f90 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/pr70828.f90 diff --git a/gcc/gimplify.c b/gcc/gimplify.c index dbd0f0e..d704aef 100644 --- a/gcc/gimplify.c +++ b/gcc/gimplify.c @@ -191,6 +191,7 @@ struct gimplify_omp_ctx bool target_map_scalars_firstprivate; bool target_map_pointers_as_0len_arrays; bool target_firstprivatize_array_bases
Re: [PATCH, OpenACC] Add support for gang local storage allocation in shared memory
On Wed, 15 Aug 2018 21:56:54 +0200 Bernhard Reutner-Fischer wrote: > On 15 August 2018 18:46:37 CEST, Julian Brown > wrote: > >On Mon, 13 Aug 2018 12:06:21 -0700 > >Cesar Philippidis wrote: > > atttribute has more t than strictly necessary. > Don't like signed integer levels where they should be some unsigned. > Also don't like single switch cases instead of if. > And omitting function comments even if the hook way above is > documented may be ok ish but is a bit lazy ;) Here's a new version with those comments addressed. I also changed the logic around a little to avoid adding decls to the vec in omp_context which would never be given the gang-private attribute. Re-tested with offloading to NVPTX. OK? Julian 2018-08-10 Julian Brown Chung-Lin Tang gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" attribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_addressable_var_decls fields. (new_omp_context): Initialize oacc_addressable_var_decls in new omp_context. (delete_omp_context): Delete oacc_addressable_var_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. commit e276442550a85b62866ba13890eacf4e946d1079 Author: Julian Brown Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown Chung-Lin Tang gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" attribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_addressable_var_decls fields. (new_omp_context): Initialize oacc_addressable_var_decls in new omp_context. (delete_omp_context): Delete oacc_addressable_var_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++
Re: [PATCH, OpenACC] Add support for gang local storage allocation in shared memory
On Mon, 13 Aug 2018 12:06:21 -0700 Cesar Philippidis wrote: > So in other words, this is safe for fortran. It probably could use a > fortran test, because that functionality wasn't explicitly exercised > in og7/og8. Here's a new version of the patch with a Fortran test case. It's not too easy to write a test that depends on whether gang-local variables actually end up in the right kind of memory, so I wrote one that scans the omplower dump instead. Many other (including execution) tests will already trigger the new behaviour. Tested with offloading to NVPTX. OK? Thanks, Julian 2018-08-10 Julian Brown Chung-Lin Tang gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. commit b73428237720be8d5b6e793f8615204356336d30 Author: Julian Brown Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown Chung-Lin Tang gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx
Re: [PATCH, OpenACC] Add support for gang local storage allocation in shared memory
On Mon, 13 Aug 2018 11:42:26 -0700 Cesar Philippidis wrote: > On 08/13/2018 09:21 AM, Julian Brown wrote: > > > diff --git > > a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c > > b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file > > mode 100644 index 000..2fa708a --- /dev/null > > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c > > @@ -0,0 +1,106 @@ > > +/* { dg-xfail-run-if "gangprivate > > failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */ > > As a quick comment, I like the approach that you've taken with this > patch, but the og8 patch only applies the gangprivate attribute in the > c/c++ FE. I'd have to review the notes, but I seem to recall that > excluding that clause in fortran was deliberate. Chung-Lin, do you > recall the rationale behind that? > > With that aside, is the above xfail still necessary? It seems to xpass > for me on nvptx. However, I see this regression on the host: > > FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/loop-gwv-2.c > -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -O2 execution test > > There could be other regressions, but I only tested the new tests > introduced by the patch so far. Oops, this was the version of the patch I meant to post (and the one I tested). The XFAIL on loop-gwv-2.c isn't necessary, plus that test needed some other fixes to make it pass for NVPTX (it was written for GCN to start with). Everything else is the same. I'll see what I can come up with for a Fortran test. Thanks, Julian commit 7834b2f0dffec3e56e510c04e1663424b778fdfb Author: Julian Brown Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown Chung-Lin Tang gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index c0b0a2e..14eb842 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -73,6 +73,7 @@ #include "cfgloop.h" #include "fold-const.h" #include "intl.h" +#include "tree-hash-traits.h" /* This file should be included last. */ #include "target-def.h" @@ -137,6 +138,12 @@ static unsigned worker_red_size; static unsigned worker_red_align; static GTY(()) rtx worker_red_sym; +/* Shared memory block for gang-private variables. */ +static unsigned gangprivate_shared_size; +static unsigned gangprivate_shared_align; +static GTY(()) rtx gangprivate_shared_sym; +static hash_map gangprivate_shared_hmap; + /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; @@ -210,6 +217,10 @@ nvptx_option_override (void) SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared"); + SET_SYMBOL_DATA_AREA (gangprivate_s
[PATCH, OpenACC] Add support for gang local storage allocation in shared memory
This patch adds support for placing gang-private variables in NVPTX per-CU shared memory. This is done by marking up addressable variables declared at the appropriate parallelism level with an attribute ("oacc gangprivate") in omp-low.c. Target-dependent code in the NVPTX backend then modifies the symbol associated with the variable at expand time via a new target hook (TARGET_GOACC_EXPAND_ACCEL_VAR) in order to place it in shared memory, which is faster to access than the ".local" memory that would otherwise be used for such variables. This has (theoretical, at least) consequences on program semantics, in that the shared memory is also statically-allocated rather than obeying stack discipline -- but you can't have recursive routine calls in OpenACC anyway, so that's no big deal. Other targets can use the same attribute in different ways, as appropriate. OK for trunk? Thanks, Julian 2018-08-10 Julian Brown Chung-Lin Tang gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. commit 9637e7ea887e100f35d99b8d12101f9f8a9b94e3 Author: Julian Brown Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown Chung-Lin Tang gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
Re: ivopts vs. garbage collection
On Mon, 11 Jan 2016 13:51:25 -0700 Tom Tromeywrote: > > "Michael" == Michael Matz writes: > > Michael> Well, that's a hack. A solution is to design something that > Michael> works generally for garbage collected languages with such > Michael> requirements instead of arbitrarily limiting transformations > Michael> here and there. It could be something like the notion of > Michael> derived pointers, where the base pointer needs to stay alive > Michael> as long as the derived pointers are. > > This was done once in GCC, for the Modula 3 compiler. > There was a paper about it, but I can't find it any more. > > The basic idea was to emit a description of the stack frame that their > GC could read. They had a moving GC that could use this information > to rewrite the frame when moving objects. This one perhaps? https://www.cs.purdue.edu/homes/hosking/papers/ismm06.pdf Julian
Re: [PATCH, libgomp] Rewire OpenACC async
On Tue, 24 Nov 2015 18:27:24 +0800 Chung-Lin Tangwrote: > Hi, this patch reworks some of the way that asynchronous copyouts are > implemented for OpenACC in libgomp. > > Before this patch, we had a somewhat confusing way of implementing > this by having two refcounts for each mapping: refcount and > async_refcount, which I never got working again after the last wave > of async regressions showed up. > > So this patch implements what I believe to be a simplification: > async_refcount is removed, and instead of trying to queue the async > copyouts during unmapping we actually do that during the plugin event > handling. This requires a addition of the async stream integer as an > argument to the register_async_cleanup plugin hook, but overall I > think this should be more elegant than before. This looks OK to me I think (I've only looked fairly briefly). I vaguely remember trying something along these lines in an earlier iteration of the async support -- maybe hitting problems with locking (I see you have code to mitigate problems with that, and locking generally has probably evolved a bit since I last looked at the code in detail anyway). Can event_gc ever be called when the *device* lock is held? I'm slightly concerned that pushing async unmapping into event_gc means that program-level semantics are deferred to the backend, which is arguably the wrong place. But then I don't understand what went wrong with the dual-refcount implementation, so maybe it's unavoidable for some reason. HTH, Julian
Re: [OpenACC 0/7] host_data construct
On Thu, 19 Nov 2015 16:57:23 +0100 Jakub Jelinek <ja...@redhat.com> wrote: > If it is unclear, I think disallowing acc {parallel,kernels} inside of > acc host_data might be too big hammer, but perhaps just erroring out > or warning during gimplification that if you (explicitly or > implicitly) try to map a var that is in use_device clause in some > outer context, it is either wrong, unsupported or will not do what > users think? I think we can only assume that trying to map a variable declared in a surrounding use_device clause is undefined behaviour. I haven't had any response to my questions about host_data & deviceptr on the OpenACC list. > > #pragma acc host_data use_device(x) > > { > > target_primitive(x); > > #pragma acc parallel deviceptr(x) > > { > > ... > > } > > } > > Is deviceptr as above meant to work? That is the OpenACC counterpart > of is_device_ptr, right? If yes, then I'd suggest just warning if you > try to implicitly or explicitly map something use_device in outer > contexts, and just make sure you don't ICE on the cases where you > warn. If the standard does not say what it means, then it is > unspecified behavior... A problem with deviceptr, unlike is_device_ptr, is that it turns out to be defined only to work with pointers, not arrays (OpenACC 2.0a 2.6.5.2), and there are no rules describing the latter decaying to the former. So at least if 'x' is an array, it appears the answer is "no". So, the attached patch disallows (via raising an error): * Variables being declared in explicit mapping clauses that are declared in enclosing host_data regions. * Variables being implicitly used (mapped) in offloaded regions that are declared in enclosing host_data regions. It's otherwise equivalent to the previously-posted version, but without the hacks to {maybe_,}lookup_decl_in_outer_ctx. I added checks for the above conditions during gimplification, which seemed to be about the same phase that other similar kinds of errors are diagnosed. Tests look OK (libgomp/gcc/g++/libstdc++), and the new ones pass. OK for mainline? Thanks, Julian ChangeLog Julian Brown <jul...@codesourcery.com> Cesar Philippidis <ce...@codesourcery.com> James Norris <james_nor...@mentor.com> gcc/ * c-family/c-pragma.c (oacc_pragmas): Add PRAGMA_OACC_HOST_DATA. * c-family/c-pragma.h (pragma_kind): Add PRAGMA_OACC_HOST_DATA. (pragma_omp_clause): Add PRAGMA_OACC_CLAUSE_USE_DEVICE. * c/c-parser.c (c_parser_omp_clause_name): Add use_device support. (c_parser_oacc_clause_use_device): New function. (c_parser_oacc_all_clauses): Add use_device support. (OACC_HOST_DATA_CLAUSE_MASK): New macro. (c_parser_oacc_host_data): New function. (c_parser_omp_construct): Add host_data support. * c/c-tree.h (c_finish_oacc_host_data): Add prototype. * c/c-typeck.c (c_finish_oacc_host_data): New function. (c_finish_omp_clauses): Add use_device support. * cp/cp-tree.h (finish_oacc_host_data): Add prototype. * cp/parser.c (cp_parser_omp_clause_name): Add use_device support. (cp_parser_oacc_all_clauses): Add use_device support. (OACC_HOST_DATA_CLAUSE_MASK): New macro. (cp_parser_oacc_host_data): New function. (cp_parser_omp_construct): Add host_data support. (cp_parser_pragma): Add host_data support. * cp/semantics.c (finish_omp_clauses): Add use_device support. (finish_oacc_host_data): New function. * gimple-pretty-print.c (dump_gimple_omp_target): Add host_data support. * gimple.h (gf_mask): Add GF_OMP_TARGET_KIND_OACC_HOST_DATA. (is_gimple_omp_oacc): Add support for above. * gimplify.c (omp_region_type): Add ORT_ACC_HOST_DATA. (omp_notice_variable): Diagnose undefined implicit uses of use_device variables in offloaded regions. (gimplify_scan_omp_clauses): Add host_data, use_device support. Diagnose undefined mapping of use_device variables in OpenACC clauses. (gimplify_omp_workshare): Add host_data support. (gimplify_expr): Likewise. * omp-builtins.def (BUILT_IN_GOACC_HOST_DATA): New. * omp-low.c (lookup_decl_in_outer_ctx) (maybe_lookup_decl_in_outer_ctx): Add optional argument to skip host_data regions. (scan_sharing_clauses): Support use_device. (check_omp_nesting_restrictions): Support host_data. (expand_omp_target): Support host_data. (lower_omp_target): Skip over outer host_data regions when looking up decls. Support use_device. (make_gimple_omp_edges): Support host_data. * tree-nested.c (convert_nonlocal_omp_clauses): Add use_device clause. libgomp/ * oacc-parallel.c (GOACC_host_data): New function. * libgomp.map (GOACC_host_data): Add to GOACC_2.0.1. * testsuite/libgomp.oacc-c-c++-common/host_data-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/host_data-2.c: New t
Re: [OpenACC 0/7] host_data construct
On Thu, 19 Nov 2015 14:13:45 +0100 Jakub Jelinek <ja...@redhat.com> wrote: > On Wed, Nov 18, 2015 at 12:47:47PM +0000, Julian Brown wrote: > > The FE/gimplifier part is okay, but I really don't like the > omp-low.c changes, mostly the *lookup_decl_in_outer_ctx* changes. > If I count well, we have right now 27 maybe_lookup_decl_in_outer_ctx > callers and 7 lookup_decl_in_outer_ctx callers, you want to change > behavior of 1 maybe_lookup_decl_in_outer_ctx and 1 > lookup_decl_in_outer_ctx. Why exactly those 2 and not the others? The not-very-good reason is that those are the merely the places that allowed the supplied examples to work, and I'm wary of changing other code that I don't understand very well. > What are the exact rules (what does the standard say about it)? > I'd expect that all phases (scan_sharing_clauses, lower_omp* and > expand_omp*) should agree on the same behavior, otherwise I can't see > how it can work properly. OK, thanks -- as to what the standard says, it's so ill-specified in this area that nothing can be learned about the behaviour of offloaded regions within host_data constructs, and my question about that on the technical mailing list is still unanswered (actually Nathan suggested in private mail that the conservative thing to do would be to disallow offloaded regions entirely within host_data constructs, so maybe that's the way to go). OpenMP 4.5 seems to *not* specify the skipping-over behaviour for use_device_ptr variables (p105, lines 20-23): "The is_device_ptr clause is used to indicate that a list item is a device pointer already in the device data environment and that it should be used directly. Support for device pointers created outside of OpenMP, specifically outside of the omp_target_alloc routine and the use_device_ptr clause, is implementation defined." That suggests that use_device_ptr is a valid way to create device pointers for use in enclosed target regions: the behaviour I assumed was wrong for OpenACC. So I think my guess at the "most-obvious" behaviour was probably misguided anyway. It's maybe even more complicated. Consider the example: char x[1024]; #pragma acc enter data copyin(x) #pragma acc host_data use_device(x) { target_primitive(x); #pragma acc parallel present(x)[1] { x[5] = 0;[2] } } Here, the "present" clause marked [1] will fail (because 'x' is a target pointer now). If it's omitted, the array access [2] will cause an implicit present_or_copy to be used for the 'x' pointer (which again will fail, because now 'x' points to target data). Maybe what we actually need is, #pragma acc host_data use_device(x) { target_primitive(x); #pragma acc parallel deviceptr(x) { ... } } with the deviceptr(x) clause magically substituted in the parallel construct, but I'm struggling to see how we could justify doing that when that behaviour's not mentioned in the spec at all. Aha, so: maybe manually using deviceptr(x) is implicitly mandatory in this situation, and missing it out should be an error? That suddenly seems to make most sense. I'll see about fixing the patch to do that. Julian
Re: [OpenACC 0/7] host_data construct
On Thu, 12 Nov 2015 11:16:21 + Julian Brown <jul...@codesourcery.com> wrote: > Here's a version of the patch which (hopefully) brings OpenACC on par > with OpenMP with respect to use_device/use_device_ptr variables. The > implementation is essentially the same now for OpenACC as for OpenMP > (i.e. using mapping structures): so for now, only array or pointer > variables can be used as use_device variables. The included tests have > been adjusted accordingly. Here's a rebased version of the patch, since the previous version no longer applies cleanly. Re-tested OK (libgomp tests). ChangeLog as before. (Ping.) Juliancommit 0201a5927c380da65d6400afad4a0e277fb85786 Author: Julian Brown <jul...@codesourcery.com> Date: Mon Nov 2 06:31:47 2015 -0800 OpenACC host_data support using mapping regions. diff --git a/gcc/c-family/c-pragma.c b/gcc/c-family/c-pragma.c index 12c3e75..56cf697 100644 --- a/gcc/c-family/c-pragma.c +++ b/gcc/c-family/c-pragma.c @@ -1251,6 +1251,7 @@ static const struct omp_pragma_def oacc_pragmas[] = { { "declare", PRAGMA_OACC_DECLARE }, { "enter", PRAGMA_OACC_ENTER_DATA }, { "exit", PRAGMA_OACC_EXIT_DATA }, + { "host_data", PRAGMA_OACC_HOST_DATA }, { "kernels", PRAGMA_OACC_KERNELS }, { "loop", PRAGMA_OACC_LOOP }, { "parallel", PRAGMA_OACC_PARALLEL }, diff --git a/gcc/c-family/c-pragma.h b/gcc/c-family/c-pragma.h index 999ac67..dd246b9 100644 --- a/gcc/c-family/c-pragma.h +++ b/gcc/c-family/c-pragma.h @@ -33,6 +33,7 @@ enum pragma_kind { PRAGMA_OACC_DECLARE, PRAGMA_OACC_ENTER_DATA, PRAGMA_OACC_EXIT_DATA, + PRAGMA_OACC_HOST_DATA, PRAGMA_OACC_KERNELS, PRAGMA_OACC_LOOP, PRAGMA_OACC_PARALLEL, @@ -167,6 +168,7 @@ enum pragma_omp_clause { PRAGMA_OACC_CLAUSE_SELF, PRAGMA_OACC_CLAUSE_SEQ, PRAGMA_OACC_CLAUSE_TILE, + PRAGMA_OACC_CLAUSE_USE_DEVICE, PRAGMA_OACC_CLAUSE_VECTOR, PRAGMA_OACC_CLAUSE_VECTOR_LENGTH, PRAGMA_OACC_CLAUSE_WAIT, diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c index 7b10764..0a5c8bb 100644 --- a/gcc/c/c-parser.c +++ b/gcc/c/c-parser.c @@ -10267,6 +10267,8 @@ c_parser_omp_clause_name (c_parser *parser) result = PRAGMA_OMP_CLAUSE_UNTIED; else if (!strcmp ("use_device_ptr", p)) result = PRAGMA_OMP_CLAUSE_USE_DEVICE_PTR; + else if (!strcmp ("use_device", p)) + result = PRAGMA_OACC_CLAUSE_USE_DEVICE; break; case 'v': if (!strcmp ("vector", p)) @@ -11619,6 +11621,15 @@ c_parser_oacc_clause_tile (c_parser *parser, tree list) return c; } +/* OpenACC 2.0: + use_device ( variable-list ) */ + +static tree +c_parser_oacc_clause_use_device (c_parser *parser, tree list) +{ + return c_parser_omp_var_list_parens (parser, OMP_CLAUSE_USE_DEVICE, list); +} + /* OpenACC: wait ( int-expr-list ) */ @@ -12928,6 +12939,10 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask, clauses = c_parser_oacc_data_clause (parser, c_kind, clauses); c_name = "self"; break; + case PRAGMA_OACC_CLAUSE_USE_DEVICE: + clauses = c_parser_oacc_clause_use_device (parser, clauses); + c_name = "use_device"; + break; case PRAGMA_OACC_CLAUSE_SEQ: clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ, clauses); @@ -13577,6 +13592,29 @@ c_parser_oacc_enter_exit_data (c_parser *parser, bool enter) /* OpenACC 2.0: + # pragma acc host_data oacc-data-clause[optseq] new-line + structured-block +*/ + +#define OACC_HOST_DATA_CLAUSE_MASK \ + ( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_USE_DEVICE) ) + +static tree +c_parser_oacc_host_data (location_t loc, c_parser *parser) +{ + tree stmt, clauses, block; + + clauses = c_parser_oacc_all_clauses (parser, OACC_HOST_DATA_CLAUSE_MASK, + "#pragma acc host_data"); + + block = c_begin_omp_parallel (); + add_stmt (c_parser_omp_structured_block (parser)); + stmt = c_finish_oacc_host_data (loc, clauses, block); + return stmt; +} + + +/* OpenACC 2.0: # pragma acc loop oacc-loop-clause[optseq] new-line structured-block @@ -16884,6 +16922,9 @@ c_parser_omp_construct (c_parser *parser) case PRAGMA_OACC_DATA: stmt = c_parser_oacc_data (loc, parser); break; +case PRAGMA_OACC_HOST_DATA: + stmt = c_parser_oacc_host_data (loc, parser); + break; case PRAGMA_OACC_KERNELS: case PRAGMA_OACC_PARALLEL: strcpy (p_name, "#pragma acc"); diff --git a/gcc/c/c-tree.h b/gcc/c/c-tree.h index 6bc216a..848131e 100644 --- a/gcc/c/c-tree.h +++ b/gcc/c/c-tree.h @@ -653,6 +653,7 @@ extern tree c_finish_goto_ptr (location_t, tree); extern tree c_expr_to_decl (tree, bool *, bool *); extern tree c_finish_omp_construct (location_t, enum tree_code, tree, tree); extern tree c_finish_oacc_data (location_t, tree, tree); +extern tree c_finish_oacc_host_data (location_t, tree, tree); extern tree c_begi
Re: [OpenACC 0/7] host_data construct
On Mon, 2 Nov 2015 18:33:39 + Julian Brown <jul...@codesourcery.com> wrote: > On Mon, 26 Oct 2015 19:34:22 +0100 > Jakub Jelinek <ja...@redhat.com> wrote: > > > Your use_device sounds very similar to use_device_ptr clause in > > OpenMP, which is allowed on #pragma omp target data construct and is > > implemented quite a bit differently from this; it is unclear if the > > OpenACC standard requires this kind of implementation, or you just > > chose to implement it this way. In particular, the GOMP_target_data > > call puts the variables mentioned in the use_device_ptr clauses into > > the mapping structures (similarly how map clause appears) and the > > corresponding vars are privatized within the target data region > > (which is a host region, basically a fancy { } braces), where the > > private variables contain the offloading device's pointers. > > As the author of the original patch, I have to say using the mapping > structures seems like a far better approach, but I've hit some trouble > with the details of adapting OpenACC to use that method. Here's a version of the patch which (hopefully) brings OpenACC on par with OpenMP with respect to use_device/use_device_ptr variables. The implementation is essentially the same now for OpenACC as for OpenMP (i.e. using mapping structures): so for now, only array or pointer variables can be used as use_device variables. The included tests have been adjusted accordingly. One awkward part of the implementation concerns nesting offloaded regions within host_data regions: #define N 1024 int main (int argc, char* argv[]) { int x[N]; #pragma acc data copyin (x[0:N]) { int *xp; #pragma acc host_data use_device (x) { [...] #pragma acc parallel present (x) copyout (xp) { xp = x; } } assert (xp == acc_deviceptr (x)); } return 0; } I think the meaning of 'x' as seen within the clauses of the parallel directive should be the *host* version of x, not the mapped target address (I've asked on the OpenACC technical mailing list to clarify this point, but no reply as yet). The changes to {maybe_,}lookup_decl_in_outer_ctx "skip over" host_data contexts when called from lower_omp_target. There's probably an analogous case for OpenMP, but I've not tried to handle that. No regressions for libgomp tests, and the new tests pass. OK for trunk? Thanks, Julian ChangeLog Julian Brown <jul...@codesourcery.com> Cesar Philippidis <ce...@codesourcery.com> James Norris <james_nor...@mentor.com> gcc/ * c-family/c-pragma.c (oacc_pragmas): Add PRAGMA_OACC_HOST_DATA. * c-family/c-pragma.h (pragma_kind): Add PRAGMA_OACC_HOST_DATA. (pragma_omp_clause): Add PRAGMA_OACC_CLAUSE_USE_DEVICE. * c/c-parser.c (c_parser_omp_clause_name): Add use_device support. (c_parser_oacc_clause_use_device): New function. (c_parser_oacc_all_clauses): Add use_device support. (OACC_HOST_DATA_CLAUSE_MASK): New macro. (c_parser_oacc_host_data): New function. (c_parser_omp_construct): Add host_data support. * c/c-tree.h (c_finish_oacc_host_data): Add prototype. * c/c-typeck.c (c_finish_oacc_host_data): New function. (c_finish_omp_clauses): Add use_device support. * cp/cp-tree.h (finish_oacc_host_data): Add prototype. * cp/parser.c (cp_parser_omp_clause_name): Add use_device support. (cp_parser_oacc_all_clauses): Add use_device support. (OACC_HOST_DATA_CLAUSE_MASK): New macro. (cp_parser_oacc_host_data): New function. (cp_parser_omp_construct): Add host_data support. (cp_parser_pragma): Add host_data support. * cp/semantics.c (finish_omp_clauses): Add use_device support. (finish_oacc_host_data): New function. * gimple-pretty-print.c (dump_gimple_omp_target): Add host_data support. * gimple.h (gf_mask): Add GF_OMP_TARGET_KIND_OACC_HOST_DATA. (is_gimple_omp_oacc): Add support for above. * gimplify.c (gimplify_scan_omp_clauses): Add host_data, use_device support. (gimplify_omp_workshare): Add host_data support. (gimplify_expr): Likewise. * omp-builtins.def (BUILT_IN_GOACC_HOST_DATA): New. * omp-low.c (lookup_decl_in_outer_ctx) (maybe_lookup_decl_in_outer_ctx): Add optional argument to skip host_data regions. (scan_sharing_clauses): Support use_device. (check_omp_nesting_restrictions): Support host_data. (expand_omp_target): Support host_data. (lower_omp_target): Skip over outer host_data regions when looking up decls. Support use_device. (make_gimple_omp_edges): Support host_data. * tree-nested.c (convert_nonlocal_omp_clauses): Add use_device clause. libgomp/ * oacc-parallel.c (GOACC_host_data): New function. * libgomp.map (GOACC_host_data): Add to GOACC_2.0.1. * testsuite/libgomp.oacc-c-c++-common/host_data-1.c: New test. * tests
Re: [PATCH/RFC/RFA] Machine modes for address printing (all targets)
On Thu, 5 Nov 2015 11:22:04 +0100 Bernd Schmidtwrote: > > static void > > -mcore_print_operand_address (FILE * stream, rtx x) > > +mcore_print_operand_address (FILE * stream, machine_mode mode > > ATTRIBUTE_UNUSED, > > +rtx x) > > So apparently we're settling on writing the unused arg as just > "machine_mode" without a name. Please change everywhere. > > > @@ -1754,7 +1754,7 @@ mmix_print_operand_punct_valid_p (unsign > > /* TARGET_PRINT_OPERAND_ADDRESS. */ > > > > static void > > -mmix_print_operand_address (FILE *stream, rtx x) > > +mmix_print_operand_address (FILE *stream, machine_mode mode, rtx x) > > { > >if (REG_P (x)) > > { > > The arg appears to be unused - I'd expect to see a warning here. I've fixed those two, and a handful of other bits I missed. > Other thank that it looks OK. I'm not going to require that you test > every target, but it would be good to have the full set built to cc1 > before and after, and please be on the lookout for fallout. Thanks! I used the attached "build-all.sh" to test all the targets affected by the patch with "make all-gcc": those now all succeed (I'm sure I reinvented a wheel here, but perhaps the target list is useful to someone else). Julian ChangeLog gcc/ * final.c (output_asm_insn): Pass VOIDmode to output_address. (output_address): Add MODE argument. Pass to print_operand_address hook. * targhooks.c (default_print_operand_address): Add MODE argument. * targhooks.h (default_print_operand_address): Update prototype. * output.h (output_address): Update prototype. * target.def (print_operand_address): Add MODE argument. * config/vax/vax.c (print_operand_address): Pass VOIDmode to output_address. (print_operand): Pass access mode to output_address. * config/mcore/mcore.c (mcore_print_operand_address): Add MODE argument. (mcore_print_operand): Update calls to mcore_print_operand_address. * config/fr30/fr30.c (fr30_print_operand): Pass VOIDmode to output_address. * config/lm32/lm32.c (lm32_print_operand): Pass mode in calls to output_address. * config/tilegx/tilegx.c (output_memory_reference_mode): Remove global. (tilegx_print_operand): Don't set above global. Update calls to output_address. (tilegx_print_operand_address): Add MODE argument. Use instead of output_memory_reference_mode global. * config/frv/frv.c (frv_print_operand_address): Add MODE argument. (frv_print_operand): Pass mode to frv_print_operand_address calls. * config/mn10300/mn10300.c (mn10300_print_operand): Pass mode to output_address. * config/cris/cris.c (cris_print_operand_address): Add MODE argument. (cris_print_operand): Pass mode to output_address calls. * config/spu/spu.c (print_operand): Pass mode to output_address calls. * config/aarch64/aarch64.h (aarch64_print_operand) (aarch64_print_operand_address): Remove prototypes. * config/aarch64/aarch64.c (aarch64_memory_reference_mode): Delete global. (aarch64_print_operand): Make static. Update calls to output_address. (aarch64_print_operand_address): Add MODE argument. Use instead of aarch64_memory_reference_mode global. (TARGET_PRINT_OPERAND, TARGET_PRINT_OPERAND_ADDRESS): Define target hooks. * config/aarch64/aarch64.h (PRINT_OPERAND, PRINT_OPERAND_ADDRESS): Delete macro definitions. * config/pa/pa.c (pa_print_operand): Pass mode in output_address calls. * config/xtensa/xtensa.c (print_operand): Pass mode in output_address calls. * config/h8300/h8300.c (h8300_print_operand_address): Add MODE argument. (h83000_print_operand): Update calls to h8300_print_operand_address and output_address. * config/ia64/ia64.c (ia64_print_operand_address): Add MODE argument. * config/tilepro/tilepro.c (output_memory_reference_mode): Delete global. (tilepro_print_operand): Pass mode to output_address. (tilepro_print_operand_address): Add MODE argument. Use instead of output_memory_reference_mode. * config/nvptx/nvptx.c (output_decl_chunk, nvptx_assemble_integer) (nvptx_output_call_insn, nvptx_print_address_operand): Pass VOIDmode to output_address calls. (nvptx_print_operand_address): Add MODE argument. * config/alpha/alpha.c (print_operand): Pass mode argument in output_address calls. * config/m68k/m68k.c (print_operand): Pass mode argument in output_address call. * config/avr/avr.c (avr_print_operand_address): Add MODE argument. (avr_print_operand): Update calls to avr_print_operand_address. * config/sparc/sparc.c (sparc_print_operand_address): Add MODE argument. Update calls to output_address. (sparc_print_operand): Pass mode to output_address. * config/iq2000/iq2000.c (iq2000_print_operand_address): Add MODE argument. (iq2000_print_operand): Pass mode in output_address calls. *
[PATCH/RFC/RFA] Machine modes for address printing (all targets)
Hi, Depending on assembler syntax and supported addressing modes, several targets need to know the machine mode for a memory access when printing an address (i.e. for automodify addresses that need to know the size of their access), but it is not available with the current TARGET_PRINT_OPERAND_ADDRESS hook. This leads to an ugly corner in the operand output mechanism, where address printing gets split between different parts of a backend, or some other hack (e.g. a global variable) is used to communicate the machine mode to the address printing hook. Using a global variable also leads to a latent (?) bug on at least AArch64: attempts to use the 'a' operand printing code cause final.c to call output_address (in turn invoking the PRINT_OPERAND_ADDRESS macro) *without* first setting the magic global aarch64_memory_reference_mode, which means a stale value will be used instead. The full list of targets that use some form of workaround for the lack of machine mode in the address printing hook is (E): aarch64: uses magic global. arc: pre/post inc/dec handled in print_operand. arm: uses magic global. c6x epiphany: offsets handled in print_operand. m32r: hard-wires 4 for access size. nds32 tilegx: uses magic global. tilepro: uses magic global. That's not all targets by any means, but may be enough to warrant a change in the interface. I propose that: * The output_address function should have a machine_mode argument added. Bare addresses (e.g. the 'a' case in final.c) should pass "VOIDmode" for this argument. * Other callers of output_address -- actually all in backends -- can pass the machine mode for the memory access in question. * The TARGET_PRINT_OPERAND_ADDRESS hook shall also have a machine_mode argument added. The legacy PRINT_OPERAND_ADDRESS hook can be left alone. (The documentation for the operand-printing hooks needs fixing too, incidentally.) The attached patch makes this change, fairly mechanically. This removes (most of) the magic globals for address printing, but I haven't tried to refactor the targets that use other hacks to print correct auto-modify addresses (that can be done by their respective maintainers, hopefully, and should result in a nice cleanup). Unfortunately I can't hope to test all the targets affected, though the subset of targets that it's relatively easy for me to build, build fine. I also ran regression tests for AArch64. OK to apply, or any comments, or any further testing required? Thanks, Julian ChangeLog gcc/ * final.c (output_asm_insn): Pass VOIDmode to output_address. (output_address): Add MODE argument. Pass to print_operand_address hook. * targhooks.c (default_print_operand_address): Add MODE argument. * targhooks.h (default_print_operand_address): Update prototype. * output.h (output_address): Update prototype. * target.def (print_operand_address): Add MODE argument. * config/vax/vax.c (print_operand_address): Pass VOIDmode to output_address. (print_operand): Pass access mode to output_address. * config/mcore/mcore.c (mcore_print_operand_address): Add MODE argument. (mcore_print_operand): Update calls to mcore_print_operand_address. * config/fr30/fr30.c (fr30_print_operand): Pass VOIDmode to output_address. * config/lm32/lm32.c (lm32_print_operand): Pass mode in calls to output_address. * config/tilegx/tilegx.c (output_memory_reference_mode): Remove global. (tilegx_print_operand): Don't set above global. Update calls to output_address. (tilegx_print_operand_address): Add MODE argument. Use instead of output_memory_reference_mode global. * config/frv/frv.c (frv_print_operand_address): Add MODE argument. * config/mn10300/mn10300.c (mn10300_print_operand): Pass mode to output_address. * config/cris/cris.c (cris_print_operand_address): Add MODE argument. (cris_print_operand): Pass mode to output_address calls. * config/spu/spu.c (print_operand): Pass mode to output_address calls. * config/aarch64/aarch64.h (aarch64_print_operand) (aarch64_print_operand_address): Remove prototypes. * config/aarch64/aarch64.c (aarch64_memory_reference_mode): Delete global. (aarch64_print_operand): Make static. Update calls to output_address. (aarch64_print_operand_address): Add MODE argument. Use instead of aarch64_memory_reference_mode global. (TARGET_PRINT_OPERAND, TARGET_PRINT_OPERAND_ADDRESS): Define target hooks. * config/aarch64/aarch64.h (PRINT_OPERAND, PRINT_OPERAND_ADDRESS): Delete macro definitions. * config/pa/pa.c (pa_print_operand): Pass mode in output_address calls. * config/xtensa/xtensa.c (print_operand): Pass mode in output_address calls. * config/h8300/h8300.c (h8300_print_operand_address): Add MODE argument. (h83000_print_operand): Update calls to h8300_print_operand_address and output_address. * config/ia64/ia64.c
Re: [Bulk] [OpenACC 0/7] host_data construct
On Mon, 26 Oct 2015 19:34:22 +0100 Jakub Jelinekwrote: > Your use_device sounds very similar to use_device_ptr clause in > OpenMP, which is allowed on #pragma omp target data construct and is > implemented quite a bit differently from this; it is unclear if the > OpenACC standard requires this kind of implementation, or you just > chose to implement it this way. In particular, the GOMP_target_data > call puts the variables mentioned in the use_device_ptr clauses into > the mapping structures (similarly how map clause appears) and the > corresponding vars are privatized within the target data region > (which is a host region, basically a fancy { } braces), where the > private variables contain the offloading device's pointers. As the author of the original patch, I have to say using the mapping structures seems like a far better approach, but I've hit some trouble with the details of adapting OpenACC to use that method. Firstly, on trunk at least, use_device_ptr variables are restricted to pointer or array types: that restriction doesn't exist in OpenACC, nor actually could I find it in the OpenMP 4.1 document (my guess is the standards are supposed to match in this regard). I think that a program such as this should work: void target_fn (int *targ_data); int main (int argc, char *argv[]) { char out; int myvar; #pragma omp target enter data map(to: myvar) #pragma omp target data use_device_ptr(myvar) map(from:out) { target_fn (); out = 5; } return 0; } "myvar" would have its address taken in the use_device_ptr region, and places where the corresponding mapped variable has its address taken would be replaced by a direct use of the mapped pointer. (Or is that not a well-formed thing to do, in general?). This fails with "error: 'use_device_ptr' variable is neither a pointer nor an array". Secondly, attempts to use use_device_ptr on (e.g. dynamically-allocated) arrays accessed through a pointer cause an ICE with the existing trunk OpenMP code: #include void target_fn (char *targ_data); int main (int argc, char *argv[]) { char *myarr, out; myarr = malloc (1024); #pragma omp target data map(to: myarr[0:1024]) { #pragma omp target data use_device_ptr(myarr) map(from:out) { target_fn (myarr); out = 5; } } return 0; } udp3.c: In function 'main': udp3.c:6:1: internal compiler error: in make_decl_rtl, at varasm.c:1298 main (int argc, char *argv[]) ^ 0x111256b make_decl_rtl(tree_node*) /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/varasm.c:1294 0x9ea005 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/expr.c:9559 0x9e31c2 expand_expr_real(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/expr.c:7892 0x9cb4ae expand_expr /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/expr.h:255 0x9d907d expand_assignment(tree_node*, tree_node*, bool) /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/expr.c:5089 0x89e219 expand_gimple_stmt_1 /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/cfgexpand.c:3576 0x89e60d expand_gimple_stmt /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/cfgexpand.c:3672 0x8a5773 expand_gimple_basic_block /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/cfgexpand.c:5676 0x8a72d4 execute /scratch/jbrown/openacc-trunk/src/gcc-mainline/gcc/cfgexpand.c:6288 Furthermore, this looks strange to me (006t.omplower): .omp_data_arr.5.out = myarr.8 = myarr; .omp_data_arr.5.myarr = myarr.8; #pragma omp target data map(from:out [len: 1]) use_device_ptr(myarr) { D.2436 = .omp_data_arr.5.myarr; myarr = D.2436; That's clobbering the original myarr variable, right? Any clues on these two? The omp-low.c code is rather opaque to me... Thanks, Julian
Re: [PATCH] [ARM] neon-testgen.ml typo
Hi, On Thu, 29 Oct 2015 10:23:58 -0700 Jim Wilsonwrote: > I noticed a comment typo in this file while using grep to look for > other stuff. The typo is easy to fix. > > I tried running neon-testgen.ml to verify, but it is apparently no > longer valid ocaml, as it doesn't work with the ocamlc 4.01.0 I have > on Ubuntu 14.04. I get a syntax error. Someone who knows ocaml will > have to fix this. Meanwhile, the patch to fix the typo should still > be OK, as this is a separate problem. This seems to work for me (semicolons in OCaml are separators not terminators - I'm not sure why this worked before). OK to apply? Julian ChangeLog gcc/ * config/arm/neon-testgen.ml (emit_epilogue): Remove extraneous brackets and semicolon.Index: gcc/config/arm/neon-testgen.ml === --- gcc/config/arm/neon-testgen.ml (revision 229410) +++ gcc/config/arm/neon-testgen.ml (working copy) @@ -130,14 +130,14 @@ let emit_call chan const_valuator c_type let emit_epilogue chan features regexps = let no_op = List.exists (fun feature -> feature = No_op) features in Printf.fprintf chan "}\n\n"; -(if not no_op then - List.iter (fun regexp -> - Printf.fprintf chan - "/* { dg-final { scan-assembler \"%s\" } } */\n" regexp) +if not no_op then + List.iter (fun regexp -> + Printf.fprintf chan +"/* { dg-final { scan-assembler \"%s\" } } */\n" regexp) regexps - else - () -); +else + () + (* Check a list of C types to determine which ones are pointers and which ones are const. *)
Re: [gomp4 00/14] NVPTX: further porting
On Thu, 22 Oct 2015 19:41:51 +0300 Alexander Monakovwrote: > On Thu, 22 Oct 2015, Jakub Jelinek wrote: > > Does that apply also to threads within a warp? I.e. is .local > > local to each thread in the warp, or to the whole warp, and if the > > former, how can say at the start of a SIMD region or at its end the > > local vars be broadcast to other threads and collected back? One > > thing is scalar vars, another pointers, or references to various > > types, or even bigger indirection. > > .local is indeed local to each warp member, not the warp as a whole. > What OpenACC/PTX implementation does is to copy the whole stack > frame, plus live registers: the implementation is in > nvptx.c:nvptx_propagate. > > I see two possible alternative approaches for OpenMP/PTX. > The second approach is to run all threads in the warp all the time, > making sure they execute the same code with the same data, and thus > build up the same local state. In this case we'd need to ensure this > invariant: if threads in the warp have the same state prior to > executing an instruction, they also have the same state after > executing that instruction (plus global state changes as if only one > thread executed that instruction). > > Most instructions are safe w.r.t this invariant. > Was something like this considered (and rejected?) for OpenACC? I'm not sure we understood the "global state changes as if only one thread executed that instruction" bit (do you have a citation?). But anyway, even if that works for threads within a warp, it doesn't work for warps within a CTA, so we'd still need some broadcast mechanism for those. Julian
Re: [OpenACC 1/11] UNIQUE internal function
On Thu, 22 Oct 2015 10:05:30 +0200 Richard Bienerwrote: > On Thu, Oct 22, 2015 at 9:59 AM, Jakub Jelinek > wrote: > > On Thu, Oct 22, 2015 at 09:49:29AM +0200, Richard Biener wrote: > >> >> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such > >> >> distinct internal fns. This replaces that scheme. > >> >> > >> >> ok? > >> > > >> > Hmm, I'd just have used gimple_has_volatile_ops on the call? > >> > That should have the > >> > desired effects. > >> > >> That is, whatever new IFNs you need are ok, but special-casing > >> them is not necessary if you properly mark the calls as volatile. > > > > I don't see gimple_has_volatile_ops used in tracer.c or > > tree-ssa-threadedge.c. Setting gimple_has_volatile_ops on those > > IFNs is fine, but I think they are even stronger than that. > > Hmm, indeed. Now I fail to see how the implemented property > "preserves the CFG looping structure". And I would have expected > can_copy_bbs_p to be adjusted instead (catching more cases and the > threading and tracer case as well). > > As far as I can see nothing would prevent dissolving the loop by > completely unolling it for example. Or deleting it because it has no > side-effects. > > So you'd need to be more precise as to what properties you are trying > to preserve by placing a single stmt somewhere. FWIW an earlier, abandoned attempt at solving the same problem was discussed in the following thread, continuing through June: https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02647.html Though the details of lowering of OpenACC constructs have changed with Nathan's current patches, the underlying problem remains the same. PTX requires certain operations (bar.sync) to be executed uniformly by all threads in a CTA. IIUC this affects "JOIN" points across all workers/vectors in a gang, in particular (though this is generic code, other -- particularly GPU -- targets may have similar restrictions). HTH, Julian
Re: Repository for the conversion machinery
On Fri, 28 Aug 2015 17:50:53 + Joseph Myerswrote: > shinwell = Mark Shinwell > (Jane Street) Mark's current address is mshinw...@janestreet.com. Julian
[gomp4] Some additional OpenACC reduction tests
Hi, This is a set of 19 new tests for OpenACC reductions, covering several ways of performing reductions over the parallel and loop directives using gang or worker/vector level parallelism. (The semantics are quite subtle in some places, but I believe the tests follow the specification to the letter at least, EOE.) Several of these do not pass yet, so have been marked with XFAILs. I will apply to gomp4 branch shortly. Cheers, Julian ChangeLog libgomp/ * testsuite/libgomp.oacc-c-c++-common/loop-reduction-*.c: New tests. * testsuite/par-reduction-*.c: New tests. * testsuite/libgomp.oacc-c-c++-common/par-loop-comb-reduction-*.c: New tests.commit d6cb22b11bbe6f536bd0f6d5ce8349266040 Author: Julian Brown jul...@codesourcery.com Date: Wed Jul 29 10:04:36 2015 -0700 Some new OpenACC reduction tests. diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gang-np-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gang-np-1.c new file mode 100644 index 000..52f9a8f --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gang-np-1.c @@ -0,0 +1,43 @@ +#include assert.h + +/* Test of reduction on loop directive (gangs, non-private reduction + variable). */ + +int +main (int argc, char *argv[]) +{ + int i, arr[1024], res = 0, hres = 0; + + for (i = 0; i 1024; i++) +arr[i] = i; + + #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(res) + { +#pragma acc loop gang reduction(+:res) +for (i = 0; i 1024; i++) + res += arr[i]; + } + + for (i = 0; i 1024; i++) +hres += arr[i]; + + assert (res == hres); + + res = hres = 1; + + #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(res) + { +#pragma acc loop gang reduction(*:res) +for (i = 0; i 12; i++) + res *= arr[i]; + } + + for (i = 0; i 12; i++) +hres *= arr[i]; + + assert (res == hres); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gv-np-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gv-np-1.c new file mode 100644 index 000..b5e3b2f --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gv-np-1.c @@ -0,0 +1,28 @@ +#include assert.h + +/* Test of reduction on loop directive (gangs and vectors, non-private + reduction variable). */ + +int +main (int argc, char *argv[]) +{ + int i, arr[1024], res = 0, hres = 0; + + for (i = 0; i 1024; i++) +arr[i] = i; + + #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(res) + { +#pragma acc loop gang vector reduction(+:res) +for (i = 0; i 1024; i++) + res += arr[i]; + } + + for (i = 0; i 1024; i++) +hres += arr[i]; + + assert (res == hres); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gw-np-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gw-np-1.c new file mode 100644 index 000..d724680 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gw-np-1.c @@ -0,0 +1,28 @@ +#include assert.h + +/* Test of reduction on loop directive (gangs and workers, non-private + reduction variable). */ + +int +main (int argc, char *argv[]) +{ + int i, arr[1024], res = 0, hres = 0; + + for (i = 0; i 1024; i++) +arr[i] = i; + + #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(res) + { +#pragma acc loop gang worker reduction(+:res) +for (i = 0; i 1024; i++) + res += arr[i]; + } + + for (i = 0; i 1024; i++) +hres += arr[i]; + + assert (res == hres); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gwv-np-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gwv-np-1.c new file mode 100644 index 000..d610373 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gwv-np-1.c @@ -0,0 +1,28 @@ +#include assert.h + +/* Test of reduction on loop directive (gangs, workers and vectors, non-private + reduction variable). */ + +int +main (int argc, char *argv[]) +{ + int i, arr[1024], res = 0, hres = 0; + + for (i = 0; i 1024; i++) +arr[i] = i; + + #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(res) + { +#pragma acc loop gang worker vector reduction(+:res) +for (i = 0; i 1024; i++) + res += arr[i]; + } + + for (i = 0; i 1024; i++) +hres += arr[i]; + + assert (res == hres); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gwv-np-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gwv-np-2.c new file mode 100644 index 000..3e5c707 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-reduction-gwv-np-2.c @@ -0,0 +1,36 @@ +/* { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } { * } { } } */ + +#include assert.h
Re: [gomp4] Remove device-specific filtering during parsing for OpenACC
On Fri, 17 Jul 2015 14:57:14 +0200 Thomas Schwinge tho...@codesourcery.com wrote: In combination with the equivant change to gcc/cp/parser.c:cp_parser_oacc_all_clauses, gcc/c-family/c-omp.c:c_oacc_filter_device_types, and transitively also the struct identifier_hasher and c_oacc_extract_device_id function preceding it, are now unused. (Not an exhaustive list; have not checked which other auxilliary functions etc. Cesar has added in his device_type changes.) Does it make any sense to keep these for later, or dump them now? The attached patch removes this dead code... --- a/gcc/c/c-typeck.c +++ b/gcc/c/c-typeck.c @@ -12568,6 +12568,10 @@ c_finish_omp_clauses (tree clauses, bool oacc) pc = OMP_CLAUSE_CHAIN (c); continue; +case OMP_CLAUSE_DEVICE_TYPE: + pc = OMP_CLAUSE_DEVICE_TYPE_CLAUSES (c); + continue; + case OMP_CLAUSE_INBRANCH: case OMP_CLAUSE_NOTINBRANCH: if (branch_seen) From a quick glance only, this seems to be different from the C++ front end (have not checked Fortran). I have not looked at what the front end parsing is now actually doing; is it just attaching any clauses following a device_type clause to the latter? (The same should be done for all front ends, obviously. Even if it's not important right now, because of the sorry diagnostic that will be emitted later on as soon as there is one device_type clause, this should best be addressed now, while you still remember what's going on here ;-) so that there will be no bad surprises once we actually implement the handling in OMP lowering/streaming/device compilers.) Do we need manually need to take care to finalize (c_finish_omp_clauses et al.) such masked clause chains, or will the right thing happen automatically? ...and fixes the C and C++ frontend to finalize parsed device_type clauses properly (although so far finalization doesn't do anything for the clauses that can be associated with a device_type clause anyway, so there's no actual change in behaviour). I haven't moved the sorry reporting for the unsupported device_type clause to scan_sharing_clauses because it doesn't seem to be particularly a more logical place, and doing so breaks the tests that scan the omp-low dumps. I will apply to gomp4 branch as obvious, shortly. Thanks, Julian ChangeLog gcc/ * c-family/c-omp.c (c_oacc_extract_device_id, identifier_hasher) (c_oacc_filter_device_types): Remove dead code. * c/c-typeck.c (c_finish_omp_clauses): Add scanning for sub-clauses of device_type clause. * cp/semantics.c (finish_omp_clauses): Likewise.commit e24a9cd14d4b8b5dab8b37218b29844787809648 Author: Julian Brown jul...@codesourcery.com Date: Mon Jul 27 07:31:10 2015 -0700 Clause finalization cleanups and dead code removal. diff --git a/gcc/c-family/c-omp.c b/gcc/c-family/c-omp.c index b76de69..10190d7 100644 --- a/gcc/c-family/c-omp.c +++ b/gcc/c-family/c-omp.c @@ -1081,132 +1081,6 @@ c_omp_predetermined_sharing (tree decl) return OMP_CLAUSE_DEFAULT_UNSPECIFIED; } -/* Return a numerical code representing the device_type. Currently, - only device_type(nvidia) is supported. All device_type parameters - are treated as case-insensitive keywords. */ - -static int -c_oacc_extract_device_id (const char *device) -{ - if (!strcasecmp (device, nvidia)) -return GOMP_DEVICE_NVIDIA_PTX; - else if (!strcmp (device, *)) -return GOMP_DEVICE_DEFAULT; - return GOMP_DEVICE_NONE; -} - -struct identifier_hasher : ggc_cache_ptr_hashtree_node -{ - static hashval_t hash (tree t) { return htab_hash_pointer (t); } - static bool equal (tree a, tree b) - { -return !strcmp(IDENTIFIER_POINTER (a), IDENTIFIER_POINTER (b)); - } -}; - -/* Filter out the list of unsupported OpenACC device_types. */ - -tree -c_oacc_filter_device_types (tree clauses) -{ - tree c, prev; - tree dtype = NULL_TREE; - tree seen_nvidia = NULL_TREE; - tree seen_default = NULL_TREE; - hash_tableidentifier_hasher *dt_htab -= hash_tableidentifier_hasher::create_ggc (10); - - /* First scan for all device_type clauses. */ - for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) -{ - if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_DEVICE_TYPE) - { - tree t; - - for (t = OMP_CLAUSE_DEVICE_TYPE_DEVICES (c); t; t = TREE_CHAIN (t)) - { - if (dt_htab-find (t)) - { - error_at (OMP_CLAUSE_LOCATION (c), - duplicate device_type (%s), - IDENTIFIER_POINTER (t)); - goto filter_dtype; - } - - int code = c_oacc_extract_device_id (IDENTIFIER_POINTER (t)); - - if (code == GOMP_DEVICE_DEFAULT) - seen_default = OMP_CLAUSE_DEVICE_TYPE_CLAUSES (c); - else if (code == GOMP_DEVICE_NVIDIA_PTX) - seen_nvidia = OMP_CLAUSE_DEVICE_TYPE_CLAUSES (c); - else - { - /* The OpenACC technical committee advises compilers - to silently ignore unknown devices. */ - } - - tree *slot = dt_htab-find_slot (t, INSERT); - *slot = t
Re: [gomp4] Remove device-specific filtering during parsing for OpenACC
On Fri, 17 Jul 2015 14:57:14 +0200 Thomas Schwinge tho...@codesourcery.com wrote: Hi Julian! On Thu, 16 Jul 2015 16:32:12 +0100, Julian Brown jul...@codesourcery.com wrote: This patch removes the device-specific filtering (for NVidia PTX) from the parsing stages of the host compiler (for the device_type clause -- separately for C, C++ and Fortran) in favour of fully parsing the device_type clauses, but not actually implementing anything for them (device_type support is a feature that we're not planning to implement just yet: the existing support is something of a red herring). With this patch, the parsed device_type clauses will be ready at OMP lowering time whenever we choose to do something with them (e.g. transforming them into a representation that can be streamed out and re-read by the appropriate offload compiler). The representation is more-or-less the same for all supported languages Thanks! modulo clause ordering. Is that something that a) doesn't need to be/already has been addressed (with your patch), or b) still needs to be addressed? It's something that doesn't matter, I think: clauses are chained together like this: num_gangs num_workers ... | device_type(foo) \__num_gangs(OMP_CLAUSE_DEVICE_TYPE_CLAUSES) | num_workers | ... device_type(bar) \__num_gangs | num_workers | ... V (OMP_CLAUSE_CHAIN) foo and bar are OMP_CLAUSE_DEVICE_TYPE_DEVICES -- tree lists. The Fortran front-end will emit num_gangs, num_workers etc. clauses in a fixed order (irrespective of their order in the source program), but the C and C++ frontends will emit them in the (reverse of the) order encountered. There isn't really a consumer for this information yet, but when there is, it will just have to not care about that (which should be straightforward, I think). I've altered the dtype-*.* tests to account for the new behaviour (and to not use e.g. mixed-case nVidia or acc_device_nvidia names, which are contrary to the recommendations in the spec). OpenACC 2.0a indeed seems to suggest that device_type arguments are case-sensitive -- contrary to the ACC_DEVICE_TYPE environment variable, which probably is where the idea came from to parse them case-insensitive. As to the latter invalid names, I thought the idea has been to verify that the clauses following such device_types clauses are indeed ignored in the later processing. (Obviously, there should've been comments indicating that, as otherwise that's very confusing -- as we've just seen -- due to the similarity to the runtime library's acc_device_* device type values.) Yes, and there are still some tests for that functionality. I figured there wasn't much point in over-testing it, especially since none of this code does that much yet. OK to apply, or any comments? Your commit r225927 appears to have caused: [-PASS:-]{+FAIL: libgomp.fortran/declare-simd-2.f90 -O0 (internal compiler error)+} {+FAIL:+} libgomp.fortran/declare-simd-2.f90 -O0 (test for excess errors) [-PASS:-]{+UNRESOLVED:+} libgomp.fortran/declare-simd-2.f90 -O0 [-execution test-] [-PASS:-]{+compilation failed to produce executable+} [same for other optimization levels] [...]/source-gcc/libgomp/testsuite/libgomp.fortran/declare-simd-3.f90:17:0: internal compiler error: Segmentation fault 0xc39b6f crash_signal [...]/source-gcc/gcc/toplev.c:352 0x7043a8 gfc_trans_omp_clauses [...]/source-gcc/gcc/fortran/trans-openmp.c:2671 0x7049a8 gfc_trans_omp_declare_simd(gfc_namespace*) [...]/source-gcc/gcc/fortran/trans-openmp.c:4589 0x6b8542 gfc_get_extern_function_decl(gfc_symbol*) [...]/source-gcc/gcc/fortran/trans-decl.c:2025 0x6b878d gfc_get_extern_function_decl(gfc_symbol*) [...]/source-gcc/gcc/fortran/trans-decl.c:1820 0x6ce952 conv_function_val [...]/source-gcc/gcc/fortran/trans-expr.c:3601 0x6ce952 gfc_conv_procedure_call(gfc_se*, gfc_symbol*, gfc_actual_arglist*, gfc_expr*, vectree_node*, va_gc, vl_embed*) [...]/source-gcc/gcc/fortran/trans-expr.c:5873 0x6cf4c2 gfc_conv_expr(gfc_se*, gfc_expr*) [...]/source-gcc/gcc/fortran/trans-expr.c:7391 0x6d71d0 gfc_trans_assignment_1 [...]/source-gcc/gcc/fortran/trans-expr.c:9127 0x692465 trans_code [...]/source-gcc/gcc/fortran/trans.c:1674 0x6fa457 gfc_trans_omp_code [...]/source-gcc/gcc/fortran/trans-openmp.c:2711 0x705410 gfc_trans_omp_do [...]/source-gcc/gcc/fortran/trans-openmp.c:3459 0x707f9f gfc_trans_omp_directive(gfc_code*) [...]/source-gcc/gcc/fortran/trans-openmp.c:4521 0x6922b7 trans_code [...]/source-gcc/gcc/fortran/trans.c:1924 0x6c0660 gfc_generate_function_code(gfc_namespace*) [...]/source-gcc/gcc/fortran/trans-decl.c:6231 0x64d630 translate_all_program_units [...]/source-gcc/gcc/fortran
Re: [gomp4] Remove device-specific filtering during parsing for OpenACC
On Fri, 17 Jul 2015 14:57:14 +0200 Thomas Schwinge tho...@codesourcery.com wrote: Your commit r225927 appears to have caused: [-PASS:-]{+FAIL: libgomp.fortran/declare-simd-2.f90 -O0 (internal compiler error)+} {+FAIL:+} libgomp.fortran/declare-simd-2.f90 -O0 (test for excess errors) [-PASS:-]{+UNRESOLVED:+} libgomp.fortran/declare-simd-2.f90 -O0 [-execution test-] [-PASS:-]{+compilation failed to produce executable+} [same for other optimization levels] This is fixed by the attached. I will apply shortly. Thanks, Julian ChangeLog gcc/fortran/ * trans-openmp.c (gfc_trans_omp_clauses): Add NULL check for clauses.commit 7171ab9066e6b4bb84c317d1892a3a0a77cf63ae Author: Julian Brown jul...@codesourcery.com Date: Fri Jul 17 11:46:56 2015 -0700 Add NULL check for clauses in gfc_trans_omp_clauses. diff --git a/gcc/fortran/trans-openmp.c b/gcc/fortran/trans-openmp.c index 20a1e65..378dd3b 100644 --- a/gcc/fortran/trans-openmp.c +++ b/gcc/fortran/trans-openmp.c @@ -2668,6 +2668,9 @@ gfc_trans_omp_clauses (stmtblock_t *block, gfc_omp_clauses *clauses, tree omp_clauses = gfc_trans_omp_clauses_1 (block, clauses, where, declare_simd); + if (clauses == NULL) +return NULL_TREE; + for (; clauses-device_types; clauses = clauses-dtype_clauses) { tree c, following_clauses = NULL_TREE, dev_list = NULL_TREE;
[gomp4] Remove device-specific filtering during parsing for OpenACC
Hi, This patch removes the device-specific filtering (for NVidia PTX) from the parsing stages of the host compiler (for the device_type clause -- separately for C, C++ and Fortran) in favour of fully parsing the device_type clauses, but not actually implementing anything for them (device_type support is a feature that we're not planning to implement just yet: the existing support is something of a red herring). With this patch, the parsed device_type clauses will be ready at OMP lowering time whenever we choose to do something with them (e.g. transforming them into a representation that can be streamed out and re-read by the appropriate offload compiler). The representation is more-or-less the same for all supported languages, modulo clause ordering. I've altered the dtype-*.* tests to account for the new behaviour (and to not use e.g. mixed-case nVidia or acc_device_nvidia names, which are contrary to the recommendations in the spec). OK to apply, or any comments? Thanks, Julian ChangeLog gcc/ * gimplify.c (gimplify_scan_omp_clauses): Handle OMP_CLAUSE_DEVICE_TYPE. (gimplify_adjust_omp_clauses): Likewise. * omp-low.c (scan_sharing_clauses): Likewise. (expand_omp_target): Add sorry for device_type support. * tree-pretty-print.c (dump_omp_clause): Add device_type support. * tree.c (walk_tree_1): Likewise. gcc/c/ * c-parser.c (c_parser_oacc_all_clauses): Don't call c_oacc_filter_device_types. * c-typeck.c (c_finish_omp_clauses): Handle OMP_CLAUSE_DEVICE_TYPE. gcc/cp/ * parser.c (cp_parser_oacc_all_clauses): Don't call c_oacc_filter_device_types. * pt.c (tsubst_omp_clauses): Handle OMP_CLAUSE_DEVICE_TYPE. * semantics.c (finish_omp_clauses): Likewise. gcc/fortran/ * gfortran.h (gfc_omp_clauses): Change dtype int field to device_types gfc_expr_list. * openmp.c (gfc_match_omp_clauses): Remove scan_dtype variable (add OMP_CLAUSE_DEVICE_TYPE directly to appropriate bitmasks). Parse all device_type clauses without filtering. (OACC_LOOP_CLAUSE_DEVICE_TYPE_MASK) (OACC_KERNELS_CLAUSE_DEVICE_TYPE_MASK) (OACC_PARALLEL_CLAUSE_DEVICE_TYPE_MASK) (OACC_ROUTINE_CLAUSE_DEVICE_TYPE_MASK) (OACC_UPDATE_CLAUSE_DEVICE_TYPE_MASK): Add OMP_CLAUSE_DEVICE_TYPE. * trans-openmp.c (gfc_trans_omp_clauses): Translate device_type clauses, and split old body into... (gfc_trans_omp_clauses_1): New function. gcc/testsuite/ * c-c++-common/goacc/dtype-1.c: Update test for new behaviour. * c-c++-common/goacc/dtype-2.c: Likewise. * c-c++-common/goacc/dtype-3.c: Likewise. * c-c++-common/goacc/dtype-4.c: Likewise. * gfortran.dg/goacc/dtype-1.f95: Likewise. * gfortran.dg/goacc/dtype-2.f95: Likewise. * gfortran.dg/goacc/dtype-3.f: Likewise.commit 123298186bb8ce87f84b6a3a72743939d4fdae11 Author: Julian Brown jul...@codesourcery.com Date: Thu Jul 16 08:06:01 2015 -0700 Fix device_type parsing, add sorry() for missing implementation of remainder. diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c index 1c65abf..d90c18e 100644 --- a/gcc/c/c-parser.c +++ b/gcc/c/c-parser.c @@ -12439,10 +12439,7 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask, c_parser_skip_to_pragma_eol (parser); if (finish_p) -{ - clauses = c_oacc_filter_device_types (clauses); - return c_finish_omp_clauses (clauses, true); -} +return c_finish_omp_clauses (clauses, true); return clauses; } diff --git a/gcc/c/c-typeck.c b/gcc/c/c-typeck.c index 98b8e3d..dcc246c 100644 --- a/gcc/c/c-typeck.c +++ b/gcc/c/c-typeck.c @@ -12568,6 +12568,10 @@ c_finish_omp_clauses (tree clauses, bool oacc) pc = OMP_CLAUSE_CHAIN (c); continue; +case OMP_CLAUSE_DEVICE_TYPE: + pc = OMP_CLAUSE_DEVICE_TYPE_CLAUSES (c); + continue; + case OMP_CLAUSE_INBRANCH: case OMP_CLAUSE_NOTINBRANCH: if (branch_seen) diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c index 28f0048..80aabed 100644 --- a/gcc/cp/parser.c +++ b/gcc/cp/parser.c @@ -29879,10 +29879,7 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask, cp_parser_skip_to_pragma_eol (parser, pragma_tok); if (finish_p) -{ - clauses = c_oacc_filter_device_types (clauses); - return finish_omp_clauses (clauses, true); -} +return finish_omp_clauses (clauses, true); return clauses; } diff --git a/gcc/cp/pt.c b/gcc/cp/pt.c index 205dc30..056b2c1 100644 --- a/gcc/cp/pt.c +++ b/gcc/cp/pt.c @@ -13666,6 +13666,7 @@ tsubst_omp_clauses (tree clauses, bool declare_simd, case OMP_CLAUSE_AUTO: case OMP_CLAUSE_SEQ: case OMP_CLAUSE_TILE: + case OMP_CLAUSE_DEVICE_TYPE: break; default: gcc_unreachable (); diff --git a/gcc/cp/semantics.c b/gcc/cp/semantics.c index 8935eb6..1ce1dfa 100644 --- a/gcc/cp/semantics.c +++ b/gcc/cp/semantics.c @@ -5951,6 +5951,7 @@ finish_omp_clauses (tree clauses, bool oacc) case OMP_CLAUSE_BIND: case OMP_CLAUSE_NOHOST: case
Re: [gomp4] Preserve NVPTX reconvergence points
On Mon, 22 Jun 2015 16:24:56 +0200 Jakub Jelinek ja...@redhat.com wrote: On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say all threads in a CTA/warp will be coherent after such-and-such a loop), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become ill-formed.) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how late loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) I don't understand why lowering the way you suggest helps here at all. In the proposed scheme, you essentially have whole function in e.g. worker-single or vector-single mode, which you need to be able to handle properly in any case, because users can write such routines themselves. And then you can have a loop in such a function that has some special attribute, a hint that it is desirable to vectorize it (for PTX the PTX way) or use vector-single mode for it in a worker-single function. So, the special pass then of course needs to handle all the needed broadcasting and reduction required to change the mode from e.g. worker-single to vector-single, but the convergence points still would be either on the boundary of such loops to be vectorized or parallelized, or wherever else they appear in normal vector-single or worker-single functions (around the calls to certainly calls?). I think most of my concerns are centred around loops (with the markings you suggest) that might be split into parts: if that cannot happen for loops that are annotated as you describe, maybe things will work out OK. (Apologies for my ignorance here, this isn't a part of the compiler that I know anything about.) Julian
Re: [gomp4] Preserve NVPTX reconvergence points
On Mon, 22 Jun 2015 16:24:56 +0200 Jakub Jelinek ja...@redhat.com wrote: On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say all threads in a CTA/warp will be coherent after such-and-such a loop), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become ill-formed.) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how late loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) I don't understand why lowering the way you suggest helps here at all. In the proposed scheme, you essentially have whole function in e.g. worker-single or vector-single mode, which you need to be able to handle properly in any case, because users can write such routines themselves. In vector-single or worker-single mode, divergence of threads within a warp or a CTA is controlled by broadcasting the controlling expression of conditional branches to the set of inactive threads, so each of those follows along with the active thread. So you only get potentially-problematic thread divergence when workers or vectors are operating in partitioned mode. So, for instance, a made-up example: #pragma acc parallel { #pragma acc loop gang for (i = 0; i N; i++)) { #pragma acc loop worker for (j = 0; j M; j++) { if (j M / 2) /* stmt 1 */ else /* stmt 2 */ } /* reconvergence point: thread barrier */ [...] } } Here stmt 1 and stmt 2 execute in worker-partitioned, vector-single mode. With early lowering, the reconvergence point can be inserted at the end of the loop, and abnormal edges (etc.) can be used to ensure that the CFG does not get changed in such a way that there is no longer a unique point at which the loop threads reconverge. With late lowering, it's no longer obvious to me if that can still be done. Julian
Re: [gomp4] Preserve NVPTX reconvergence points
On Fri, 19 Jun 2015 14:25:57 +0200 Jakub Jelinek ja...@redhat.com wrote: On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote: On 05/28/2015 05:08 PM, Jakub Jelinek wrote: I understand it is more work, I'd just like to ask that when designing stuff for the OpenACC offloading you (plural) try to take the other offloading devices and host fallback into account. The problem is that many of the transformations we need to do are really GPU specific, and with the current structure of omplow/ompexp they are being done in the host compiler. The offloading scheme we decided on does not give us the means to write out multiple versions of an offloaded function where each target gets a different one. For that reason I think we should postpone these lowering decisions until we're in the accel compiler, where they could be controlled by target hooks, and over the last two weeks I've been doing some experiments to see how that could be achieved. I wonder why struct loop flags and other info together with function attributes and/or cgraph flags and other info aren't sufficient for the OpenACC needs. Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? Why can't the execution model (normal, vector-single and worker-single) be simply attributes on functions or cgraph node flags and the kind of #acc loop simply be flags on struct loop, like already OpenMP simd / Cilk+ simd is? One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say all threads in a CTA/warp will be coherent after such-and-such a loop), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become ill-formed.) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how late loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) Julian
[gomp4] Tests for private variables/state propagation
Hi, This is a set of tests for OpenACC private variable/state propagation support in GCC. The associated functionality is a work-in-progress: as such, many of these tests do not pass yet (causing incorrect results, ICEs or even bogus assembly output). I believe the tests to be valid OpenACC, though it's possible I misinterpreted the spec at some points! I will apply to the gomp4 branch shortly. (We will of course be working on addressing the failures.) Cheers, Julian ChangeLog libgomp/ * testsuite/libgomp.oacc-c-c++-common/ private-vars-par-gang-{1,2,3}.c: New tests. * testsuite/libgomp.oacc-c-c++-common/ private-vars-local-gang-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/ private-vars-loop-gang-{1,2,3,4,5,6}.c: New tests. * testsuite/libgomp.oacc-c-c++-common/ private-vars-loop-worker-{1,2,3,4,5,6,7}.c: New tests. * testsuite/libgomp.oacc-c-c++-common/ private-vars-local-worker-{1,2,3,4,5}.c: New tests. * testsuite/libgomp.oacc-c-c++-common/ private-vars-loop-vector-{1,2}.c: New tests.commit 40193f49480f0a0b750d15049d29fd427282c5f0 Author: Julian Brown jul...@codesourcery.com Date: Tue Jun 16 03:50:55 2015 -0700 New set of private variable/state propagation tests. diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-gang-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-gang-1.c new file mode 100644 index 000..ada46d0 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-gang-1.c @@ -0,0 +1,38 @@ +#include assert.h + +/* Test of gang-private variables declared in local scope with parallel + directive. */ + +#if defined(ACC_DEVICE_TYPE_host) || defined(ACC_DEVICE_TYPE_host_nonshm) +#define ACTUAL_GANGS 1 +#else +#define ACTUAL_GANGS 32 +#endif + +int +main (int argc, char* argv[]) +{ + int x = 5, i, arr[ACTUAL_GANGS]; + + for (i = 0; i ACTUAL_GANGS; i++) +arr[i] = 3; + + #pragma acc parallel copy(arr) num_gangs(ACTUAL_GANGS) num_workers(8) \ + vector_length(32) + { +int x; + +#pragma acc loop gang(static:1) +for (i = 0; i ACTUAL_GANGS; i++) + x = i * 2; + +#pragma acc loop gang(static:1) +for (i = 0; i ACTUAL_GANGS; i++) + arr[i] += x; + } + + for (i = 0; i ACTUAL_GANGS; i++) +assert (arr[i] == 3 + i * 2); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-1.c new file mode 100644 index 000..f8658e5 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-1.c @@ -0,0 +1,56 @@ +/* { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } { * } { } } */ + +#include assert.h + +/* Test of worker-private variables declared in a local scope, broadcasting + to vector-partitioned mode. Back-to-back worker loops. */ + +int +main (int argc, char* argv[]) +{ + int i, arr[32 * 32 * 32]; + + for (i = 0; i 32 * 32 * 32; i++) +arr[i] = i; + + #pragma acc parallel copy(arr) num_gangs(32) num_workers(32) vector_length(32) + { +int j; + +#pragma acc loop gang +for (i = 0; i 32; i++) + { +#pragma acc loop worker + for (j = 0; j 32; j++) + { + int k; + int x = i ^ j * 3; + + #pragma acc loop vector + for (k = 0; k 32; k++) + arr[i * 1024 + j * 32 + k] += x * k; + } + + #pragma acc loop worker + for (j = 0; j 32; j++) + { + int k; + int x = i | j * 5; + + #pragma acc loop vector + for (k = 0; k 32; k++) + arr[i * 1024 + j * 32 + k] += x * k; + } + } + } + + for (i = 0; i 32; i++) +for (int j = 0; j 32; j++) + for (int k = 0; k 32; k++) +{ + int idx = i * 1024 + j * 32 + k; + assert (arr[idx] == idx + (i ^ j * 3) * k + (i | j * 5) * k); + } + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-2.c new file mode 100644 index 000..925f9a0 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-2.c @@ -0,0 +1,51 @@ +/* { dg-xfail-run-if TODO { openacc_nvidia_accel_selected } { * } { } } */ + +#include assert.h + +/* Test of worker-private variables declared in a local scope, broadcasting + to vector-partitioned mode. Successive vector loops. */ + +int +main (int argc, char* argv[]) +{ + int x = 5, i, arr[32 * 32 * 32]; + + for (i = 0; i 32 * 32 * 32; i++) +arr[i] = i; + + #pragma acc parallel copy(arr) num_gangs(32) num_workers(32) vector_length(32) + { +int j; + +#pragma acc loop gang +for (i = 0; i 32; i++) + { +#pragma acc loop worker + for (j = 0; j 32; j++) + { + int k; + int x = i ^ j * 3; + + #pragma acc loop vector + for (k = 0; k 32; k++) + arr[i * 1024 + j * 32 + k] += x * k; + + x = i
[gomp4] (NVPTX) thread barriers after OpenACC worker loops
Hi, This patch adds a thread barrier after worker loops for OpenACC, in accordance with OpenACC 2.0a section 2.7.3 (worker loops): All workers will complete execution of their assigned iterations before any worker proceeds beyond the end of the loop.. (This is quite target-specific: work to alleviate that is still ongoing.) Barriers are special in that they should not be cloned or subject to excessive code motion: to that end, barriers placed after loops have their (outgoing) edge set to EDGE_ABNORMAL. That seems to suffice to keep the barriers in the right places. This passes libgomp testing when applied on gomp4 branch, and fixes the previously-broken worker-partn-5.c and worker-partn-6.c tests, on top of my previous patches: https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html https://gcc.gnu.org/ml/gcc-patches/2015-06/msg00307.html (ping!), but unfortunately (again, with the above patches) appears to interact badly with Cesar's patch for vector state propagation: https://gcc.gnu.org/ml/gcc-patches/2015-06/msg00371.html I haven't yet investigated why (I reverted that patch in my local series in order to test the attached patch). FYI, Julian ChangeLog gcc/ * omp-low.c (build_oacc_threadbarrier): New function. (oacc_loop_needs_threadbarrier_p): New function. (expand_omp_for_static_nochunk, expand_omp_for_static_chunk): Insert threadbarrier after worker loops. (find_omp_for_region_data): Rename to... (find_omp_for_region_gwv): This. Return mask, rather than modifying REGION structure. (build_omp_regions_1): Move modification of REGION structure to here, after calling above function with new name. (generate_oacc_broadcast): Use new build_oacc_threadbarrier function. (make_gimple_omp_edges): Make edges out of OpenACC worker loop exit block abnormal. * tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Add BUILT_IN_GOACC_THREADBARRIER. libgomp/ * testsuite/libgomp.oacc-c-c++-common/worker-partn-5.c: Remove XFAIL. * testsuite/libgomp.oacc-c-c++-common/worker-partn-6.c: Likewise.commit e46fbc68b7bc7e705417475fcfb8e203056b5a51 Author: Julian Brown jul...@codesourcery.com Date: Fri Jun 5 10:01:01 2015 -0700 Threadbarrier after worker and vector loops. diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 55a2a12..45ff05a 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -3691,6 +3691,15 @@ build_omp_barrier (tree lhs) return g; } +/* Build a call to GOACC_threadbarrier. */ + +static gcall * +build_oacc_threadbarrier (void) +{ + tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER); + return gimple_build_call (fndecl, 0); +} + /* If a context was created for STMT when it was scanned, return it. */ static omp_context * @@ -7181,6 +7190,20 @@ expand_omp_for_generic (struct omp_region *region, } +/* True if a barrier is needed after a loop partitioned over + gangs/workers/vectors as specified by GWV_BITS. OpenACC semantics specify + that a (conceptual) barrier is needed after worker and vector-partitioned + loops, but not after gang-partitioned loops. Currently we are relying on + warp reconvergence to synchronise threads within a warp after vector loops, + so an explicit barrier is not helpful after those. */ + +static bool +oacc_loop_needs_threadbarrier_p (int gwv_bits) +{ + return (gwv_bits (MASK_GANG | MASK_WORKER)) == MASK_WORKER; +} + + /* A subroutine of expand_omp_for. Generate code for a parallel loop with static schedule and no specified chunk size. Given parameters: @@ -7523,7 +7546,11 @@ expand_omp_for_static_nochunk (struct omp_region *region, { t = gimple_omp_return_lhs (gsi_stmt (gsi)); if (gimple_omp_for_kind (fd-for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) - gcc_checking_assert (t == NULL_TREE); + { + gcc_checking_assert (t == NULL_TREE); + if (oacc_loop_needs_threadbarrier_p (region-gwv_this)) + gsi_insert_after (gsi, build_oacc_threadbarrier (), GSI_SAME_STMT); + } else gsi_insert_after (gsi, build_omp_barrier (t), GSI_SAME_STMT); } @@ -7956,7 +7983,11 @@ expand_omp_for_static_chunk (struct omp_region *region, { t = gimple_omp_return_lhs (gsi_stmt (gsi)); if (gimple_omp_for_kind (fd-for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) - gcc_checking_assert (t == NULL_TREE); +{ + gcc_checking_assert (t == NULL_TREE); + if (oacc_loop_needs_threadbarrier_p (region-gwv_this)) + gsi_insert_after (gsi, build_oacc_threadbarrier (), GSI_SAME_STMT); + } else gsi_insert_after (gsi, build_omp_barrier (t), GSI_SAME_STMT); } @@ -10270,22 +10301,26 @@ expand_omp (struct omp_region *region) /* Map each basic block to an omp_region. */ static hash_mapbasic_block, omp_region * *bb_region_map; -/* Fill in additional data for a region REGION associated with an +/* Return a mask of GWV bits for region REGION associated with an OMP_FOR STMT. */ -static void -find_omp_for_region_data
[gomp4] Add tests for OpenACC worker-single/worker-partitioned modes
Hi, This patch adds a set of tests for worker-single predication (added by Bernd in https://gcc.gnu.org/ml/gcc-patches/2015-06/msg00094.html) and worker-partitioned mode for OpenACC. Results generally look good, though support for synchronisation after worker loops is currently missing, so the corresponding tests are XFAILed for NVidia (I will look into fixing that). I will apply shortly. Thanks, Julian ChangeLog libgomp/ * testsuite/libgomp.oacc-c-c++-common/ worker-single-{1,1a,2,3,4,5,6}.c: New tests. * testsuite/libgomp.oacc-c-c++-common/ worker-partn-{1,2,3,4,5,6,7}.c: New tests.commit c4edb6e748c86c2bc5251707f61d4d37679194cf Author: Julian Brown jul...@codesourcery.com Date: Thu Jun 4 07:16:56 2015 -0700 Add a set of OpenACC worker-single/worker-partitioned mode tests. diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-1.c new file mode 100644 index 000..1bdb8ea --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-1.c @@ -0,0 +1,30 @@ +#include assert.h + +/* Test worker-partitioned/vector-single mode. */ + +int +main (int argc, char *argv[]) +{ + int arr[32 * 8], i; + + for (i = 0; i 32 * 8; i++) +arr[i] = 0; + + #pragma acc parallel copy(arr) num_gangs(8) num_workers(8) vector_length(32) + { +int j; +#pragma acc loop gang +for (j = 0; j 32; j++) + { + int k; + #pragma acc loop worker + for (k = 0; k 8; k++) + arr[j * 8 + k] += j * 8 + k; + } + } + + for (i = 0; i 32 * 8; i++) +assert (arr[i] == i); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-2.c new file mode 100644 index 000..1023e22 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-2.c @@ -0,0 +1,44 @@ +#include assert.h + +/* Test condition in worker-partitioned mode. */ + +int +main (int argc, char *argv[]) +{ + int arr[32 * 32 * 8], i; + + for (i = 0; i 32 * 32 * 8; i++) +arr[i] = i; + + #pragma acc parallel copy(arr) num_gangs(8) num_workers(8) vector_length(32) + { +int j; +#pragma acc loop gang +for (j = 0; j 32; j++) + { + int k; + #pragma acc loop worker + for (k = 0; k 8; k++) + { + int m; + if ((k % 2) == 0) + { + #pragma acc loop vector + for (m = 0; m 32; m++) + arr[j * 32 * 8 + k * 32 + m]++; + } + else + { + #pragma acc loop vector + for (m = 0; m 32; m++) + arr[j * 32 * 8 + k * 32 + m] += 2; + } + } + } + } + + for (i = 0; i 32 * 32 * 8; i++) +assert (arr[i] == i + ((i / 32) % 2) + 1); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-3.c new file mode 100644 index 000..a13a571 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-3.c @@ -0,0 +1,54 @@ +#include assert.h + +/* Test switch in worker-partitioned mode. */ + +int +main (int argc, char *argv[]) +{ + int arr[32 * 32 * 8], i; + + for (i = 0; i 32 * 32 * 8; i++) +arr[i] = i; + + #pragma acc parallel copy(arr) num_gangs(8) num_workers(8) vector_length(32) + { +int j; +#pragma acc loop gang +for (j = 0; j 32; j++) + { + int k; + #pragma acc loop worker + for (k = 0; k 8; k++) + { + int m; + switch ((j * 32 + k) % 3) + { + case 0: + #pragma acc loop vector + for (m = 0; m 32; m++) + arr[j * 32 * 8 + k * 32 + m]++; + break; + + case 1: + #pragma acc loop vector + for (m = 0; m 32; m++) + arr[j * 32 * 8 + k * 32 + m] += 2; + break; + + case 2: + #pragma acc loop vector + for (m = 0; m 32; m++) + arr[j * 32 * 8 + k * 32 + m] += 3; + break; + + default: ; + } + } + } + } + + for (i = 0; i 32 * 32 * 8; i++) +assert (arr[i] == i + ((i / 32) % 3) + 1); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-4.c new file mode 100644 index 000..0902c80 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-4.c @@ -0,0 +1,54 @@ +#include assert.h + +/* Test worker-single/worker-partitioned transitions. */ + +int +main (int argc, char *argv[]) +{ + int n[32], arr[32 * 32], i; + + for (i = 0; i 32 * 32; i++) +arr[i] = 0; + + for (i = 0; i 32; i++) +n[i] = 0; + + #pragma acc parallel copy(n, arr) num_gangs(8) num_workers(16) \ + vector_length(32) + { +int j; +#pragma acc loop gang +for (j = 0; j 32; j++) + { + int k; + + n[j]++; + + #pragma acc loop worker + for (k = 0; k 32; k++) + arr[j * 32 + k]++; + + n[j]++; + + #pragma acc loop worker + for (k = 0; k 32; k++) + arr[j * 32 + k]++; + + n[j]++; + + #pragma acc loop worker + for (k = 0
Re: [gomp4] Preserve NVPTX reconvergence points
On Thu, 28 May 2015 16:37:04 +0200 Richard Biener richard.guent...@gmail.com wrote: On Thu, May 28, 2015 at 4:06 PM, Julian Brown jul...@codesourcery.com wrote: For NVPTX, it is vitally important that the divergence of threads within a warp can be controlled: in particular we must be able to generate code that we know reconverges at a particular point. Unfortunately GCC's middle-end optimisers can cause this property to be violated, which causes problems for the OpenACC execution model we're planning to use for NVPTX. Hmm, I don't think adding a new edge flag is good nor necessary. It seems to me that instead the broadcast operation should have abnormal control flow and thus basic-blocks should be split either before or after it (so either incoming or outgoing edge(s) should be abnormal). I suppose splitting before the broadcast would be best (thus handle it similar to setjmp ()). Here's a version of the patch that uses abnormal edges with semantics unchanged, splitting the false/non-execution edge using a dummy block to avoid the prohibited case of both EDGE_TRUE/EDGE_FALSE and EDGE_ABNORMAL on the outgoing edges of a GIMPLE_COND. So for a fragment like this: if (threadIdx.x == 0) /* cond_bb */ { /* work */ p0 = ...; /* assign */ } pN = broadcast(p0); if (pN) goto T; else goto F; Incoming edges to a broadcast operation have EDGE_ABNORMAL set: ++ |cond_bb |, ++| | (true edge) | (false edge) v v ++ +---+ | (work) | | dummy | ++ +---+ | assign || ++| ABNORM| |ABNORM v | ++---' | bcast | ++ | cond | ++ / \ T F The abnormal edges actually serve two purposes, I think: as well as ensuring the broadcast operation takes place when a warp is non-diverged/coherent, they ensure that p0 is not seen as uninitialised along the false path from cond_bb, possibly leading to the broadcast operation being optimised away as partially redundant. This feels somewhat fragile though! We'll have to continue to think about warp divergence in subsequent patches. The patch passes libgomp testing (with Bernd's recent worker-single patch also). OK for gomp4 branch (together with the previously-mentioned inline thread builtin patch)? Thanks, Julian ChangeLog gcc/ * omp-low.c (make_predication_test): Split false block out of cond_bb, making latter edge abnormal. (predicate_bb): Set EDGE_ABNORMAL on edges before broadcast operations.commit 38056ae4a29f93ce54715dfad843a233f3b0fd2a Author: Julian Brown jul...@codesourcery.com Date: Mon Jun 1 11:12:41 2015 -0700 Use abnormal edges before broadcast ops diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 7048f9f..310eb72 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -10555,7 +10555,16 @@ make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask) gsi_insert_after (tmp_gsi, cond_stmt, GSI_NEW_STMT); true_edge-flags = EDGE_TRUE_VALUE; - make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); + + /* Force an abnormal edge before a broadcast operation that might be present + in SKIP_DEST_BB. This is only done for the non-execution edge (with + respect to the predication done by this function) -- the opposite + (execution) edge that reaches the broadcast operation must be made + abnormal also, e.g. in this function's caller. */ + edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); + basic_block false_abnorm_bb = split_edge (e); + edge abnorm_edge = single_succ_edge (false_abnorm_bb); + abnorm_edge-flags |= EDGE_ABNORMAL; } /* Apply OpenACC predication to basic block BB which is in @@ -10605,6 +10614,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask) mask); edge e = split_block (bb, splitpoint); + e-flags = EDGE_ABNORMAL; skip_dest_bb = e-dest; gimple_cond_set_condition (as_a gcond * (stmt), EQ_EXPR, @@ -10624,6 +10634,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask) gsi_asgn, mask); edge e = split_block (bb, splitpoint); + e-flags = EDGE_ABNORMAL; skip_dest_bb = e-dest; gimple_switch_set_index (sstmt, new_var);
[gomp4] Expand OpenACC thread builtins inline
For partitioned loops, we're currently calling library functions (in libgcc) to determine the cardinality of the set of threads a particular loop is distributed over (given a set of gang/worker/vector toggles), and the index of the current thread within that set. This patch reimplements those two functions in terms of the (PTX-specific!) builtins that Bernd has recently added in order to implement vector-single/worker-single predication, which expand directly to machine instructions on the target (or to constant zero/one on the host). It also makes use of the same gwv bitfields that are set up by that new code. The previous BUILT_IN_GOACC_GET_THREAD_NUM and BUILT_IN_GOACC_GET_NUM_THREADS builtins are removed entirely. This works reasonably well, but there are some regressions caused by middle-end optimisers having extra freedom to manipulate the CFG in ways that PTX cannot support without the optimisation barrier of the calls to the thread builtins being present. This will be addressed by a follow-on patch. Pre-approved for gomp4, but I'll wait for comments on the follow-on patch before applying so as not to leave the branch in a broken state. Thanks, Julian ChangeLog gcc/ * builtins.c (expand_oacc_builtin): Return const1_rtx for ntid/nctaid builtins when the associated patterns are not present. * omp-builtins.def (BUILT_IN_GOACC_GET_THREAD_NUM) (BUILT_IN_GOACC_GET_NUM_THREADS): Remove. * omp-low.c (struct omp_for_data): Remove gang, worker, vector fields. (extract_omp_for_data): Don't initialise deleted gang, worker, vector fields. (expand_oacc_get_num_threads, expand_oacc_get_thread_num): New functions. (lower_reduction_clauses): Use above functions. (expand_omp_for_static_nochunk): Likewise. (expand_omp_for_static_chunk): Likewise. commit 1be8ada44a9f91d2eba16ef1f81243707647f237 Author: Julian Brown jul...@codesourcery.com Date: Fri May 15 03:20:42 2015 -0700 Inlined OpenACC thread builtins. diff --git a/gcc/builtins.c b/gcc/builtins.c index ebd4b4a..cd51821 100644 --- a/gcc/builtins.c +++ b/gcc/builtins.c @@ -5964,8 +5964,8 @@ expand_oacc_builtin (enum built_in_function fcode, tree exp, rtx target) case BUILT_IN_GOACC_NTID: #ifdef HAVE_oacc_ntid icode = CODE_FOR_oacc_ntid; - result = const1_rtx; #endif + result = const1_rtx; break; case BUILT_IN_GOACC_TID: #ifdef HAVE_oacc_tid @@ -5975,8 +5975,8 @@ expand_oacc_builtin (enum built_in_function fcode, tree exp, rtx target) case BUILT_IN_GOACC_NCTAID: #ifdef HAVE_oacc_nctaid icode = CODE_FOR_oacc_nctaid; - result = const1_rtx; #endif + result = const1_rtx; break; case BUILT_IN_GOACC_CTAID: #ifdef HAVE_oacc_ctaid diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def index ac1f802..47d9e45 100644 --- a/gcc/omp-builtins.def +++ b/gcc/omp-builtins.def @@ -69,10 +69,6 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_NCTAID, GOACC_nctaid, BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GOACC_BUILTIN (BUILT_IN_GOACC_CTAID, GOACC_ctaid, BT_FN_UINT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_THREAD_NUM, GOACC_get_thread_num, - BT_FN_INT_INT_INT_INT, ATTR_NOTHROW_LEAF_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_NUM_THREADS, GOACC_get_num_threads, - BT_FN_INT_INT_INT_INT, ATTR_NOTHROW_LEAF_LIST) DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GANGLOCAL_PTR, GOACC_get_ganglocal_ptr, BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST) DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, GOACC_deviceptr, diff --git a/gcc/omp-low.c b/gcc/omp-low.c index b114887..f82247b 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -263,7 +263,6 @@ struct omp_for_data tree chunk_size; gomp_for *for_stmt; tree pre, iter_type; - tree gang, worker, vector; int collapse; bool have_nowait, have_ordered; enum omp_clause_schedule_kind sched_kind; @@ -749,16 +748,6 @@ extract_omp_for_data (gomp_for *for_stmt, struct omp_for_data *fd, gcc_assert (fd-chunk_size == NULL_TREE); fd-chunk_size = build_int_cst (TREE_TYPE (fd-loop.v), 1); } - - /* Extract the OpenACC gang, worker and vector clauses. */ - t = find_omp_clause (gimple_omp_for_clauses (for_stmt), OMP_CLAUSE_GANG); - fd-gang = (t == NULL_TREE) ? integer_zero_node : integer_one_node; - - t = find_omp_clause (gimple_omp_for_clauses (for_stmt), OMP_CLAUSE_WORKER); - fd-worker = (t == NULL_TREE) ? integer_zero_node : integer_one_node; - - t = find_omp_clause (gimple_omp_for_clauses (for_stmt), OMP_CLAUSE_VECTOR); - fd-vector = (t == NULL_TREE) ? integer_zero_node : integer_one_node; } @@ -4919,6 +4908,159 @@ is_atomic_compatible_reduction (tree var, omp_context *ctx) return true; } + +/* Find the total number of threads used by a region partitioned by + GWV_BITS. Setup code required for the calculation is added to SEQ. Note + that this is currently used from both OMP-lowering and OMP-expansion phases, + and uses
Re: acc_on_device for device_type_host_nonshm
On Thu, 28 May 2015 04:48:58 -0700 H.J. Lu hjl.to...@gmail.com wrote: On Thu, May 21, 2015 at 4:10 AM, Jakub Jelinek ja...@redhat.com wrote: On Thu, May 21, 2015 at 01:02:12PM +0200, Thomas Schwinge wrote: Hi! On Thu, 7 May 2015 19:32:26 +0100, Julian Brown jul...@codesourcery.com wrote: Here's a new version of the patch [...] OK for trunk? Makes sense to me (with just a request to drop the testsuite changes, see below), to get the existing regressions under control. Jakub? Ok for trunk. PR libgomp/65742 gcc/ * builtins.c (expand_builtin_acc_on_device): Don't use open-coded sequence for !ACCEL_COMPILER. It breaks bootstrap on x86: https://gcc.gnu.org/ml/gcc-regression/2015-05/msg00389.html I checked in this to fix it. Apologies, and thanks! Julian
[gomp4] Preserve NVPTX reconvergence points
(canonicalize_loop_closed_ssa): Likewise. * predict.c (tree_bb_level_predictions): Likewise. * profile.c (instrument_edges, branch_prop, find_spanning_tree): Likewise. * tree-cfg.c (replace_uses_by, gimple_split_edge) (gimple_redirect_edge_and_branch, split_critical_edges): Likewise. * tree-cfgcleanup.c (tree_forwarder_block_p, remove_forwarder_block) (pass_merge_phi::execute): Likewise. * tree-chkp.c (chkp_fix_cfg): Likewise. * tree-if-conv.c (if_convertible_bb_p): Likewise. * tree-inline.c (update_ssa_across_abnormal_edges): Likewise. * tree-into-ssa.c (rewrite_update_phi_arguments) (rewrite_update_dom_walker::before_dom_children) (create_new_def_for): Likewise. * tree-outof-ssa.c (eliminate_phi): Likewise. * tree-phinodes.c (add_phi_arg): Likewise. * tree-ssa-coalesce (coalesce_cost_edge, create_outofssa_var_map) (coalesce_partitions): Likewise. * tree-ssa-dom.c (cprop_into_successor_phis) (dom_opt_dom_walker::after_dom_children, propagate_rhs_into_lhs): Likewise. * tree-ssa-loop-im.c (loop_suitable_for_sm): Likewise. * tree-ssa-loop-prefetch.c (emit_mfence_after_loop) (may_use_storent_in_loop_p): Likewise. * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Likewise. * tree-ssa-pre.c (compute_antic, insert_into_preds_of_block): Likewise. * tree-ssa-propagate.c (simulate_block, replace_phi_args_in): Likewise. * tree-ssa-sink.c (sink_code_in_bb): Likewise. * tree-ssa-threadedge.c (thread_across_edge): Likewise. * tree-ssa-threadupdate.c (thread_single_edge): Likewise. * tree-ssa-uninit.c (compute_control_dep_chain): Likewise. * tree-ssa.c (verify_phi_args): Likewise. * tree-vect-loop.c (vect_analyze_loop_form): Likewise. * value-prof.c (gimple_ic): Likewise. * tree-vrp.c (infer_value_range, process_assert_insertions_for): Likewise. (find_conditional_asserts): Skip over EDGE_TO_RECONVERGENCE edges. commit 472bd543b30356f7a4c59efc961f9f61b11ca197 Author: Julian Brown jul...@codesourcery.com Date: Wed May 20 11:35:45 2015 -0700 Introduce EDGE_TO_RECONVERGENCE, and tweak some uses of EDGE_ABNORMAL. diff --git a/gcc/basic-block.h b/gcc/basic-block.h index f28fa57..7fe25f0 100644 --- a/gcc/basic-block.h +++ b/gcc/basic-block.h @@ -70,7 +70,8 @@ enum cfg_edge_flags { Test the edge flags on EDGE_COMPLEX to detect all forms of strange control flow transfers. */ #define EDGE_COMPLEX \ - (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_EH | EDGE_PRESERVE) + (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_EH | EDGE_PRESERVE \ + | EDGE_TO_RECONVERGENCE) struct GTY(()) rtl_bb_info { /* The first insn of the block is embedded into bb-il.x. */ @@ -559,6 +560,20 @@ bb_has_abnormal_pred (basic_block bb) return false; } +static inline bool +bb_has_abnorm_or_reconv_pred (basic_block bb) +{ + edge e; + edge_iterator ei; + + FOR_EACH_EDGE (e, ei, bb-preds) +{ + if (e-flags (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) + return true; +} + return false; +} + /* Return the fallthru edge in EDGES if it exists, NULL otherwise. */ static inline edge find_fallthru_edge (vecedge, va_gc *edges) @@ -629,9 +644,10 @@ has_abnormal_or_eh_outgoing_edge_p (basic_block bb) edge_iterator ei; FOR_EACH_EDGE (e, ei, bb-succs) -if (e-flags (EDGE_ABNORMAL | EDGE_EH)) +if (e-flags (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE)) return true; return false; } + #endif /* GCC_BASIC_BLOCK_H */ diff --git a/gcc/cfg-flags.def b/gcc/cfg-flags.def index eedcd69..fd51e2f 100644 --- a/gcc/cfg-flags.def +++ b/gcc/cfg-flags.def @@ -177,6 +177,10 @@ DEF_EDGE_FLAG(TM_UNINSTRUMENTED, 15) /* Abort (over) edge out of a GIMPLE_TRANSACTION statement. */ DEF_EDGE_FLAG(TM_ABORT, 16) +/* An immutable edge to an OpenACC (currently, NVPTX) reconvergence point. + This flag is only used for the GIMPLE CFG. */ +DEF_EDGE_FLAG(TO_RECONVERGENCE, 17) + #endif /* diff --git a/gcc/cfgbuild.c b/gcc/cfgbuild.c index 7cbed50..7185f07 100644 --- a/gcc/cfgbuild.c +++ b/gcc/cfgbuild.c @@ -449,7 +449,7 @@ purge_dead_tablejump_edges (basic_block bb, rtx_jump_table_data *table) if (FULL_STATE (e-dest) BLOCK_USED_BY_TABLEJUMP) SET_STATE (e-dest, FULL_STATE (e-dest) ~(size_t) BLOCK_USED_BY_TABLEJUMP); - else if (!(e-flags (EDGE_ABNORMAL | EDGE_EH))) + else if (!(e-flags (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE))) { remove_edge (e); continue; diff --git a/gcc/cfgcleanup.c b/gcc/cfgcleanup.c index 797d14a..e73062a 100644 --- a/gcc/cfgcleanup.c +++ b/gcc/cfgcleanup.c @@ -2031,7 +2031,7 @@ try_crossjump_to_edge (int mode, edge e1, edge e2, /* Avoid deleting preserve label when redirecting ABNORMAL edges. */ if (block_has_preserve_label (e1-dest) - (e1-flags EDGE_ABNORMAL)) + (e1-flags (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))) return false; /* Here we know that the insns in the end of SRC1
Re: [gomp4] Vector-single predication
On Thu, 21 May 2015 13:57:00 +0200 Jakub Jelinek ja...@redhat.com wrote: On Thu, May 21, 2015 at 01:42:11PM +0200, Bernd Schmidt wrote: This uses the patch I committed yesterday which introduces warp broadcasts to implement the vector-single predication needed for OpenACC. Outside a loop with vector parallelism, only one of the threads representing a vector must execute, the others follow along. So we skip the real work in each basic block for the inactive threads, then broadcast the direction to take in the control flow graph from the active one, and jump as a group. This will get extended with similar functionality for worker-single. Julian is working on some patches on top of that to ensure the later optimizers don't destroy the control flow - we really need the threads to reconverge and perform the broadcast/jump in lockstep. Committed on gomp-4_0-branch. What do you do with function calls? Do you call them just in the (tid.x 31) == 0 threads (then they can't use vectorization), or for all threads (then it is an ABI change, they would need to know whether they are called this way and depending on that handle it similarly (skip all the real work, except for function calls, for (tid.x 31) != 0, unless it is a vectorized region). Or is OpenACC restricting this to statements in the constructs directly (rather than anywhere in the region)? OpenACC handles function calls specially (calling them routines -- of varying sorts, gang, worker, vector or seq, affecting where they can be invoked from). The plan is that all threads will call such routines -- and then some threads will be neutered as appropriate within the routines themselves, as appropriate. That's not actually implemented yet, though. Julian
Re: [gomp4] Vector-single predication
On Thu, 21 May 2015 14:38:19 +0100 Julian Brown jul...@codesourcery.com wrote: On Thu, 21 May 2015 15:21:54 +0200 Jakub Jelinek ja...@redhat.com wrote: On Thu, May 21, 2015 at 02:05:12PM +0100, Julian Brown wrote: OpenACC handles function calls specially (calling them routines -- of varying sorts, gang, worker, vector or seq, affecting where they can be invoked from). The plan is that all threads will call such routines -- and then some threads will be neutered as appropriate within the routines themselves, as appropriate. All functions will behave that way, or just some using some magic attribute etc.? Say will newlib functions behave this way (math functions, printf, ...)? It's actually unclear at this point if regular functions are supported by OpenACC at all (the spec says nothing about them). They probably raise interesting questions about re-entrancy, synchronisation, and so on. ...actually, replied too soon: regular math functions, etc. will be handled the same as routines declared with seq. They won't contain partitioned loops, and can be called from anywhere in an offloaded region. Julian
Re: [gomp4] Vector-single predication
On Thu, 21 May 2015 15:21:54 +0200 Jakub Jelinek ja...@redhat.com wrote: On Thu, May 21, 2015 at 02:05:12PM +0100, Julian Brown wrote: OpenACC handles function calls specially (calling them routines -- of varying sorts, gang, worker, vector or seq, affecting where they can be invoked from). The plan is that all threads will call such routines -- and then some threads will be neutered as appropriate within the routines themselves, as appropriate. All functions will behave that way, or just some using some magic attribute etc.? Say will newlib functions behave this way (math functions, printf, ...)? It's actually unclear at this point if regular functions are supported by OpenACC at all (the spec says nothing about them). They probably raise interesting questions about re-entrancy, synchronisation, and so on. For math functions e.g. it would be nice if they could behave both ways (perhaps as separate entrypoints), so have the possibility to say how many threads from the warp will perform the operation and then work on array arguments and array return value (kind like OpenMP or Cilk+ elemental functions, just perhaps with different argument/return value passing conventions). And that's something that's way outside the spec as currently defined, AFAIK. Julian
[gomp4] Lack of OpenACC NVPTX devices is not an error during scanning
Hi, This patch fixes an oversight whereby if the CUDA libraries are available for some reason on a system that doesn't actually contain an nVidia card, an OpenACC program will raise an error if the NVPTX backend is picked as a default instead of falling back to some other device instead. OK for gomp4 branch? For trunk? Thanks, Julian ChangeLog libgomp/ * plugin/plugin-nvptx.c (nvptx_get_num_devices): Return zero on cuInit failure.commit 696a0d7e22bb8217ff581886cdf0979bfc2e85bb Author: Julian Brown jul...@codesourcery.com Date: Fri May 15 03:22:56 2015 -0700 Lack of PTX devices is not an error during scanning. diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index b36691a..d09a91c 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -781,7 +781,13 @@ nvptx_get_num_devices (void) until cuInit has been called. Just call it now (but don't yet do any further initialization). */ if (instantiated_devices == 0) -cuInit (0); +{ + r = cuInit (0); + /* This is not an error: e.g. we may have CUDA libraries installed but + no devices available. */ + if (r != CUDA_SUCCESS) +return 0; +} r = cuDeviceGetCount (n); if (r!= CUDA_SUCCESS)
[gomp4] Add OpenACC vector-single/vector-partitioned tests
Hi, This patch adds several tests of vector-single/vector-partitioned mode, as part of work implementing the OpenACC execution model. Pre-approved for gomp4 branch. I will apply there shortly. Thanks, Julian ChangeLog libgomp/ * testsuite/libgomp.oacc-c-c++-common/vec-single-{1,2,3,4,5,6}.c: New tests. * testsuite/libgomp.oacc-c-c++-common/vec-partn-{1,2,3,4,5,6}.c: New tests. commit b2bb572cef2b6b0984d65995e070dc424b03a525 Author: jbrown jbrown@e7755896-6108-0410-9592-8049d3e74e28 Date: Mon May 11 16:04:48 2015 + Add vector-single/vector-partitioned tests. diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-1.c new file mode 100644 index 000..b21e588 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-1.c @@ -0,0 +1,30 @@ +#include assert.h + +/* Test basic vector-partitioned mode transitions. */ + +int +main (int argc, char *argv[]) +{ + int n = 0, arr[32], i; + + for (i = 0; i 32; i++) +arr[i] = 0; + + #pragma acc parallel copy(n, arr) num_gangs(1) num_workers(1) \ + vector_length(32) + { +int j; +n++; +#pragma acc loop vector +for (j = 0; j 32; j++) + arr[j]++; +n++; + } + + assert (n == 2); + + for (i = 0; i 32; i++) +assert (arr[i] == 1); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-2.c new file mode 100644 index 000..1ff222d --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-2.c @@ -0,0 +1,43 @@ +#include assert.h + +/* Test vector-partitioned, gang-partitioned mode. */ + +int +main (int argc, char *argv[]) +{ + int n[32], arr[1024], i; + + for (i = 0; i 1024; i++) +arr[i] = 0; + + for (i = 0; i 32; i++) +n[i] = 0; + + #pragma acc parallel copy(n, arr) num_gangs(32) num_workers(1) \ + vector_length(32) + { +int j, k; + +#pragma acc loop gang(static:*) +for (j = 0; j 32; j++) + n[j]++; + +#pragma acc loop gang +for (j = 0; j 32; j++) + #pragma acc loop vector + for (k = 0; k 32; k++) + arr[j * 32 + k]++; + +#pragma acc loop gang(static:*) +for (j = 0; j 32; j++) + n[j]++; + } + + for (i = 0; i 32; i++) +assert (n[i] == 2); + + for (i = 0; i 1024; i++) +assert (arr[i] == 1); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-3.c new file mode 100644 index 000..7908d4c --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-3.c @@ -0,0 +1,54 @@ +#include assert.h + +/* Test conditional vector-partitioned loops. */ + +int +main (int argc, char *argv[]) +{ + int n[32], arr[1024], i; + + for (i = 0; i 1024; i++) +arr[i] = 0; + + for (i = 0; i 32; i++) +n[i] = 0; + + #pragma acc parallel copy(n, arr) num_gangs(32) num_workers(1) \ + vector_length(32) + { +int j, k; + +#pragma acc loop gang(static:*) +for (j = 0; j 32; j++) + n[j]++; + +#pragma acc loop gang +for (j = 0; j 32; j++) + { + if ((j % 2) == 0) + { + #pragma acc loop vector + for (k = 0; k 32; k++) + arr[j * 32 + k]++; + } + else + { + #pragma acc loop vector + for (k = 0; k 32; k++) + arr[j * 32 + k]--; + } + } + +#pragma acc loop gang(static:*) +for (j = 0; j 32; j++) + n[j]++; + } + + for (i = 0; i 32; i++) +assert (n[i] == 2); + + for (i = 0; i 1024; i++) +assert (arr[i] == (i % 64) 32 ? 1 : -1); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-4.c new file mode 100644 index 000..4ea3bf2 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-4.c @@ -0,0 +1,46 @@ +#include assert.h + +/* Test conditions inside vector-partitioned loops. */ + +int +main (int argc, char *argv[]) +{ + int n[32], arr[1024], i; + + for (i = 0; i 1024; i++) +arr[i] = i; + + for (i = 0; i 32; i++) +n[i] = 0; + + #pragma acc parallel copy(n, arr) num_gangs(32) num_workers(1) \ + vector_length(32) + { +int j, k; + +#pragma acc loop gang(static:*) +for (j = 0; j 32; j++) + n[j]++; + +#pragma acc loop gang +for (j = 0; j 32; j++) + { + #pragma acc loop vector + for (k = 0; k 32; k++) + if ((arr[j * 32 + k] % 2) != 0) + arr[j * 32 + k] *= 2; + } + +#pragma acc loop gang(static:*) +for (j = 0; j 32; j++) + n[j]++; + } + + for (i = 0; i 32; i++) +assert (n[i] == 2); + + for (i = 0; i 1024; i++) +assert (arr[i] == ((i % 2) == 0 ? i : i * 2)); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vec-partn-5.c new file mode 100644 index
Re: acc_on_device for device_type_host_nonshm (was: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks) (PR65742)
On Fri, 17 Apr 2015 15:16:19 +0200 Jakub Jelinek ja...@redhat.com wrote: On Tue, Apr 14, 2015 at 05:43:26PM +0200, Thomas Schwinge wrote: On Tue, 14 Apr 2015 15:15:02 +0100, Julian Brown jul...@codesourcery.com wrote: On Wed, 8 Apr 2015 17:58:56 +0300 Ilya Verbin iver...@gmail.com wrote: I see several regressions: FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test I think there may be multiple issues here. The attached patch addresses one -- acc_device_type not distinguishing between offloaded and host code with the host_nonshm plugin. (You mean acc_on_device?) --- libgomp/oacc-init.c (revision 221922) +++ libgomp/oacc-init.c (working copy) @@ -548,7 +549,14 @@ ialias (acc_set_device_num) int acc_on_device (acc_device_t dev) { - if (acc_get_device_type () == acc_device_host_nonshm) + struct goacc_thread *thr = goacc_thread (); + + /* We only want to appear to be the host_nonshm plugin from offloaded + code -- i.e. within a parallel region. Test a flag set by the + openacc_parallel hook of the host_nonshm plugin to determine that. */ + if (acc_get_device_type () == acc_device_host_nonshm + thr thr-target_tls + ((struct nonshm_thread *)thr-target_tls)-nonshm_exec) return dev == acc_device_host_nonshm || dev == acc_device_not_host; /* Just rely on the compiler builtin. */ Really, acc_on_device is implemented as a compiler builtin (which is just disabled for a few libgomp test cases, in order to test the acc_on_device library function in libgomp), and I never understood why the fallback implementation in libgomp (cited above) should be doing anything different from the GCC builtin. Is the problem actually, that some The question is if the builtin expansion isn't wrong, at least as long as the host_nonshm device is meant to be supported. The #ifdef ACCEL_COMPILER case is easier, at least as long as ACCEL_COMPILER compiled code is not meant to be able to offload to other devices (or host again), but the non-ACCEL_COMPILER case means the code is either on the host, or host_nonshm, or e.g. with Intel MIC you could have some shared library be compiled by the host compiler, but then actuall linked into the MIC offloaded path. In all those cases, I think it is just the library that can determine the return value. E.g. OpenMP omp_is_initial_device function is also only implemented in the library, perhaps at some point I could expand it for #ifdef ACCEL_COMPILER as builtin, but not for the host code, at least not due to the host-nonshm plugin. Here's a new version of the patch that doesn't use the open-coded expansion for acc_on_device for the host compiler at all. This means that the host and the host_nonshm plugin should DTRT without any special compiler options (which have thus been removed from the libgomp tests that set them or refer to them). So now, for the host, acc_on_device returns: acc_on_device (acc_device_none): true acc_on_device (acc_device_host): true otherwise: false When the host_nonshm plugin is active, acc_on_device returns: acc_on_device (acc_device_host_nonshm): true (except when host fallback is in effect, i.e. because of a false if clause). acc_on_device (acc_device_not_host): likewise. otherwise: false In particular, the host_nonshm plugin doesn't consider itself to be running code on the host. OK for trunk? Julian ChangeLog PR libgomp/65742 gcc/ * builtins.c (expand_builtin_acc_on_device): Don't use open-coded sequence for !ACCEL_COMPILER. libgomp/ * oacc-init.c (plugin/plugin-host.h): Include. (acc_on_device): Check whether we're in an offloaded region for host_nonshm plugin. Don't use __builtin_acc_on_device. * plugin/plugin-host.c (GOMP_OFFLOAD_openacc_parallel): Set nonshm_exec flag in thread-local data. (GOMP_OFFLOAD_openacc_create_thread_data): Allocate thread-local data for host_nonshm plugin. (GOMP_OFFLOAD_openacc_destroy_thread_data): Free thread-local data for host_nonshm plugin. * plugin/plugin-host.h: New. * testsuite/libgomp.oacc-c-c++-common/acc_on_device-1.c: Remove -fno-builtin-acc_on_device flag. * testsuite/libgomp.oacc-c-c++-common/if-1.c: Likewise. * testsuite/libgomp.oacc-fortran/acc_on_device-1-1.f90: Remove comment re: acc_on_device builtin. * testsuite/libgomp.oacc-fortran/acc_on_device-1-2.f: Likewise. * testsuite/libgomp.oacc-fortran/acc_on_device-1-3.f: Likewise.commit adccf2e7d313263d585f63e752a4d36653d47811 Author: Julian Brown jul...@codesourcery.com Date: Tue Apr 21 12:40:45 2015 -0700 Non-SHM acc_on_device fixes diff --git a/gcc/builtins.c b/gcc
Re: [PATCH] Fix OpenACC shutdown and PTX image unloading (PR65904)
On Wed, 6 May 2015 10:32:56 +0200 Thomas Schwinge tho...@codesourcery.com wrote: Hi! On Fri, 1 May 2015 10:47:19 +0100, Julian Brown jul...@codesourcery.com wrote: The patch also fixes a thinko that was revealed in image unloading in the NVPTX backend. Tested for libgomp with PTX offloading. Confirming that both nvptx (PR65904 fixed) and (emulated) intelmic (no changes) testing look fine. Thanks for testing! By the way, do we need to lock ptx_devices in libgomp/plugin/plugin-nvptx.c:nvptx_attach_host_thread_to_device and libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_openacc_create_thread_data? (ptx_dev_lock? If yes, its definition as well as instantiated_devices should be moved to a more appropriate place, probably?) Probably yes (though I'm not sure what you mean about moving the instantiated_devices and ptx_dev_lock to a more appropriate place?). Also, several accesses to instantiated_devices are not locked by ptx_dev_lock but should be, from my cursory review. I'm not sure about that. --- a/libgomp/target.c +++ b/libgomp/target.c @@ -797,32 +797,79 @@ GOMP_offload_register (void *host_table, enum offload_target_type target_type, gomp_mutex_unlock (register_lock); } -/* This function should be called from every offload image while unloading. - It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of - the target, and TARGET_DATA needed by target plugin. */ +/* DEVICEP should be locked on entry, and remains locked on exit. */ (I'm not a native speaker, but would use what I consider to be more explicit/precise language: »must be locked« instead of »should be«. I'll be happy to learn should they mean the same thing?) I've changed the wording in a couple of comments. -void -GOMP_offload_unregister (void *host_table, enum offload_target_type target_type, -void *target_data) +static void +gomp_deoffload_image_from_device (struct gomp_device_descr *devicep, + void *host_table, void *target_data) { +/* This function should be called from every offload image while unloading. s%from%for%, I think? (And, s%should%must%, again?) No, this really is from -- this comment wasn't actually added by my patch, just moved. I'm also not sure about should in this instance -- unloading an image is already a corner-case, and maybe there are circumstances in which it'd be impossible for some given object to call the function? + It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of + the target, and TARGET_DATA needed by target plugin. */ + +void +GOMP_offload_unregister (void *host_table, enum offload_target_type target_type, +void *target_data) +{ -/* Free address mapping tables. MM must be locked on entry, and remains locked - on return. */ +/* Free address mapping tables for an active device DEVICEP. This includes + both mapped offload functions/variables, and mapped user data regions. + To be used before shutting a device down: subsequently reinitialising the + device will repopulate the offload image mappings. */ attribute_hidden void -gomp_free_memmap (struct splay_tree_s *mem_map) +gomp_free_memmap (struct gomp_device_descr *devicep) { + int i; + struct splay_tree_s *mem_map = devicep-mem_map; + + assert (devicep-is_initialized); + + gomp_mutex_lock (devicep-lock); Need to lock before first access to *devicep? Fixed. + + /* Unmap offload images that are registered to this device. */ + for (i = 0; i num_offload_images; i++) +{ + struct offload_image_descr *image = offload_images[i]; Need to take register_lock when accessing offload_images? This too. Retested for libgomp/NVPTX. OK for trunk now? Thanks, Julian ChangeLog PR libgomp/65904 libgomp/ * libgomp.h (gomp_free_memmap): Update prototype. * oacc-init.c (acc_shutdown_1): Pass device descriptor to gomp_free_memmap. Don't lock device around call. * target.c (gomp_map_vars): Initialise tgt-array to NULL before early exit. (GOMP_offload_unregister): Split out and call... (gomp_deoffload_image_from_device): This new function. (gomp_free_memmap): Call gomp_deoffload_image_from_device. * plugin/nvptx.c (struct ptx_image_data): Add ord, fn_descs fields. (nvptx_init): Tweak comment. (nvptx_attach_host_thread_to_device): Add locking with ptx_dev_lock around ptx_devices accesses. (GOMP_OFFLOAD_load_image): Populate new ptx_image_data fields. (GOMP_OFFLOAD_unload_image): Switch to ORD'th device before freeing images, and use fn_descs field from ptx_image_data instead of incorrectly using a pointer derived from target_data. (GOMP_OFFLOAD_openacc_create_thread_data): Add locking around ptx_devices accesses. Index: libgomp/target.c
Re: OpenACC: initialization with unsupported acc_device_t (was: [PR testsuite/65205, libgomp/65993] Fix dg-shouldfail usage in OpenACC libgomp tests)
On Tue, 5 May 2015 16:09:18 +0200 Thomas Schwinge tho...@codesourcery.com wrote: Hi! On Tue, 5 May 2015 08:43:48 -0400, John David Anglin dave.ang...@bell.net wrote: On 2015-05-05 5:43 AM, Thomas Schwinge wrote: FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/lib-62.c -DACC_DEVICE_TYPE_hos t=1 -DACC_MEM_SHARED=1 output pattern test, is , should match invalid size With this one I'll need your help: please cite from libgomp.log (or, from a manual run) the actual output message that you're getting. There's no output message: # ./lib-62.exe Segmentation fault (core dumped) As this is a PA-RISC HP-UX system, I feel certain that you don't actually have nvptx offloading available (so, the nvptx libgomp plugin is not being built). However, this test case, contains an unconditional acc_init call for acc_device_nvidia, and I would then guess that this situation is not (not anymore?) correctly handled (abort with »offloading to [...] not possible«, or similar; see libgomp.oacc-c-c++-common/lib-4.c) in libgomp -- Julian, could this be due to your recent libgomp OpenACC initialization changes? (When working on this in a build that does have nvptx offloading configured, I think you should be able to simulate the situation by hiding (temporarily deleting, or similar) the nvptx libgomp plugin?) The attached patch contains (what I hope should be) a fix for this, tested by running the libgomp testsuite (with nvptx offloading), and by deleting the nvptx plugin, with the patch applied, and ensuring that lib-62.c no longer segfaults in that case. The patch also tidies up a few other error paths around resolve_device, and de-duplicates some error message reporting code. Then, I don't know why libgomp.oacc-c-c++-common/lib-62.c contains this explicit acc_init call with acc_device_nvidia -- generally, the test cases should not contain such unconditional statements. So, let's then please remove this. See libgomp/testsuite/libgomp.oacc-c-c++-common/lib-66.c for a very similar test case, which does this differently. I've not touched this test though -- but I have tweaked libgomp.oacc-c-c++-common/lib-4.c that should now expect a slightly different error output. OK for trunk? Thanks, Julian ChangeLog libgomp/ * oacc-init.c (resolve_device): Add FAIL_IS_ERROR argument. Update function comment. Only call gomp_fatal if new argument is true. (acc_dev_num_out_of_range): New function. (acc_init_1, acc_shutdown_1): Update call to resolve_device. Call acc_dev_num_out_of_range as appropriate. (acc_get_num_devices, acc_set_device_type, acc_get_device_type) (acc_get_device_num, acc_set_device_num): Update calls to resolve_device. * testsuite/libgomp.oacc-c-c++-common/lib-4.c: Update expected test output. commit 221b5dea47cdb7611456ca3cf28d180d3ff1156a Author: Julian Brown jul...@codesourcery.com Date: Thu May 7 08:39:16 2015 -0700 Clean up initialisation when no devices of a particular type are available. diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c index f2c60ec..cd50521 100644 --- a/libgomp/oacc-init.c +++ b/libgomp/oacc-init.c @@ -109,10 +109,12 @@ name_of_acc_device_t (enum acc_device_t type) } } -/* ACC_DEVICE_LOCK should be held before calling this function. */ +/* ACC_DEVICE_LOCK must be held before calling this function. If FAIL_IS_ERROR + is true, this function raises an error if there are no devices of type D, + otherwise it returns NULL in that case. */ static struct gomp_device_descr * -resolve_device (acc_device_t d) +resolve_device (acc_device_t d, bool fail_is_error) { acc_device_t d_arg = d; @@ -130,7 +132,13 @@ resolve_device (acc_device_t d) dispatchers[d]-get_num_devices_func () 0) goto found; - gomp_fatal (device type %s not supported, goacc_device_type); + if (fail_is_error) + { + gomp_mutex_unlock (acc_device_lock); + gomp_fatal (device type %s not supported, goacc_device_type); + } + else + return NULL; } /* No default device specified, so start scanning for any non-host @@ -149,7 +157,13 @@ resolve_device (acc_device_t d) d = acc_device_host; goto found; } - gomp_fatal (no device found); + if (fail_is_error) +{ + gomp_mutex_unlock (acc_device_lock); + gomp_fatal (no device found); + } + else +return NULL; break; case acc_device_host: @@ -157,7 +171,12 @@ resolve_device (acc_device_t d) default: if (d _ACC_device_hwm) - gomp_fatal (device %u out of range, (unsigned)d); + { + if (fail_is_error) + goto unsupported_device; + else + return NULL; + } break; } found: @@ -166,12 +185,30 @@ resolve_device (acc_device_t d) d != acc_device_default d != acc_device_not_host); + if (dispatchers[d] == NULL fail_is_error) +{ +unsupported_device: + gomp_mutex_unlock (acc_device_lock); + gomp_fatal (device
[PATCH] Fix OpenACC shutdown and PTX image unloading (PR65904)
Hi, This patch fixes PR65904, a double-free error that started occurring after recent libgomp changes to the way offload images are registered with the runtime. Offload images now map all functions/data using just two malloc'ed blocks, but the function gomp_free_memmap did not take that into account, and treated all mappings as if they had their own blocks (as they do if created by gomp_map_vars): so attempting to free the whole map at once failed when it hit mappings for an offload image. The fix is to split offload-image freeing out of GOMP_offload_unregister into a separate function, and call that from gomp_free_memmap for the given device before freeing the rest of the memory map. The patch also fixes a thinko that was revealed in image unloading in the NVPTX backend. Tested for libgomp with PTX offloading. OK for trunk? Thanks, Julian ChangeLog libgomp/ * libgomp.h (gomp_free_memmap): Update prototype. * oacc-init.c (acc_shutdown_1): Pass device descriptor to gomp_free_memmap. Don't lock device around call. * target.c (gomp_map_vars): Initialise tgt-array to NULL before early exit. (GOMP_offload_unregister): Split out and call... (gomp_deoffload_image_from_device): This new function. (gomp_free_memmap): Call gomp_deoffload_image_from_device. * plugin/nvptx.c (struct ptx_image_data): Add ord, fn_descs fields. (GOMP_OFFLOAD_load_image): Populate above fields. (GOMP_OFFLOAD_unload_image): Switch to ORD'th device before freeing images, and use fn_descs field from ptx_image_data instead of incorrectly using a pointer derived from target_data. commit 14e8e35a494a5a8231ab1a3cad38a2157bca7e4a Author: Julian Brown jul...@codesourcery.com Date: Thu Apr 30 10:19:58 2015 -0700 Fix freeing of memory maps during acc shutdown. diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 5272f01..5e0e09c 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -777,7 +777,7 @@ extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *, extern void gomp_copy_from_async (struct target_mem_desc *); extern void gomp_unmap_vars (struct target_mem_desc *, bool); extern void gomp_init_device (struct gomp_device_descr *); -extern void gomp_free_memmap (struct splay_tree_s *); +extern void gomp_free_memmap (struct gomp_device_descr *); extern void gomp_fini_device (struct gomp_device_descr *); /* work.c */ diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c index 503f8b8..f2c60ec 100644 --- a/libgomp/oacc-init.c +++ b/libgomp/oacc-init.c @@ -245,9 +245,7 @@ acc_shutdown_1 (acc_device_t d) if (walk-dev) { - gomp_mutex_lock (walk-dev-lock); - gomp_free_memmap (walk-dev-mem_map); - gomp_mutex_unlock (walk-dev-lock); + gomp_free_memmap (walk-dev); walk-dev = NULL; walk-base_dev = NULL; diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 583ec87..2cc0ae0 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -334,8 +334,10 @@ struct ptx_event struct ptx_image_data { + int ord; void *target_data; CUmodule module; + struct targ_fn_descriptor *fn_descs; struct ptx_image_data *next; }; @@ -1625,13 +1627,6 @@ GOMP_OFFLOAD_load_image (int ord, void *target_data, link_ptx (module, img_header[0]); - pthread_mutex_lock (ptx_image_lock); - new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data)); - new_image-target_data = target_data; - new_image-module = module; - new_image-next = ptx_images; - ptx_images = new_image; - pthread_mutex_unlock (ptx_image_lock); /* The mkoffload utility emits a table of pointers/integers at the start of each offload image: @@ -1652,8 +1647,21 @@ GOMP_OFFLOAD_load_image (int ord, void *target_data, *target_table = GOMP_PLUGIN_malloc (sizeof (struct addr_pair) * (fn_entries + var_entries)); - targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor) - * fn_entries); + if (fn_entries 0) +targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor) + * fn_entries); + else +targ_fns = NULL; + + pthread_mutex_lock (ptx_image_lock); + new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data)); + new_image-ord = ord; + new_image-target_data = target_data; + new_image-module = module; + new_image-fn_descs = targ_fns; + new_image-next = ptx_images; + ptx_images = new_image; + pthread_mutex_unlock (ptx_image_lock); for (i = 0; i fn_entries; i++) { @@ -1687,23 +1695,22 @@ GOMP_OFFLOAD_load_image (int ord, void *target_data, } void -GOMP_OFFLOAD_unload_image (int tid __attribute__((unused)), void *target_data) +GOMP_OFFLOAD_unload_image (int ord, void *target_data) { - void **img_header = (void **) target_data; - struct targ_fn_descriptor *targ_fns -= (struct targ_fn_descriptor *) img_header[0]; struct ptx_image_data *image, *prev = NULL, *newhd = NULL; - free (targ_fns
Re: [PATCH] Tidy up locking for libgomp OpenACC entry points
On Thu, 23 Apr 2015 18:41:34 +0200 Thomas Schwinge tho...@codesourcery.com wrote: Hi! On Wed, 22 Apr 2015 19:42:43 +0100, Julian Brown jul...@codesourcery.com wrote: This patch is an attempt to fix some potential race conditions with accesses to shared data structures from multiple concurrent threads in libgomp's OpenACC entry points. The main change is to move locking out of lookup_host and lookup_dev in oacc-mem.c and into their callers (which can then hold the locks for the whole operation that they are performing). Yeah, that makes sense I guess. Also missing locking has been added for gomp_acc_insert_pointer. Tests look OK (with offloading to NVidia PTX). How did you test to get some confidence in the locking being sufficient? Merely by running the existing tests and via inspection, sadly. I'm not sure how much value we'd get from implementing an exhaustive threading testsuite at this stage: I guess testcases will be easier to come by in the future if/when people start to use e.g. OpenMP and OpenACC together. Going further (separate patch?), a few more comments: Is it OK that oacc-init.c:cached_base_dev is accessed without locking? Generally, we have to keep in mind that the same device may be accessed in parallel through both OpenACC and OpenMP interfaces. For this, for example, in oacc-init.c, even though acc_device_lock is held, is it OK to call gomp_init_device(D) without D-lock being locked? (Compare to target.c code.) Please document what exactly oacc-init.c:acc_device_lock is to guard. I'm not sure I'm understanding this correctly. I've attached a follow-on patch that documents the purpose of acc_device_lock -- and also fixes some places that should have been holding the lock, but were not. I've also added locking (with dev-lock) when calling gomp_init_device and gomp_fini_device from the OpenACC initialisation/finalisation code. Should oacc-init.c:acc_shutdown_1 release goacc_thread_lock before any gomp_fatal calls? (That seems to be the general policy in libgomp.) I added this to the first patch. --- a/libgomp/oacc-mem.c +++ b/libgomp/oacc-mem.c @@ -120,25 +116,32 @@ acc_free (void *d) { splay_tree_key k; struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr-dev; if (!d) return; assert (thr thr-dev); + gomp_mutex_lock (acc_dev-lock); + /* We don't have to call lazy open here, as the ptr value must have been returned by acc_malloc. It's not permitted to pass NULL in (unless you got that null from acc_malloc). */ - if ((k = lookup_dev (thr-dev-openacc.data_environ, d, 1))) - { - void *offset; + if ((k = lookup_dev (acc_dev-openacc.data_environ, d, 1))) +{ + void *offset; + + offset = d - k-tgt-tgt_start + k-tgt_offset; - offset = d - k-tgt-tgt_start + k-tgt_offset; + gomp_mutex_unlock (acc_dev-lock); - acc_unmap_data ((void *)(k-host_start + offset)); - } + acc_unmap_data ((void *)(k-host_start + offset)); +} + else +gomp_mutex_unlock (acc_dev-lock); Does it make sense to make the unlock unconditional, and move the acc_unmap_data after it, guarded by »if (k)«? I've left this one -- just a stylistic tweak, but I think it's fine as-is. - thr-dev-free_func (thr-dev-target_id, d); + acc_dev-free_func (acc_dev-target_id, d); } void @@ -178,16 +181,24 @@ acc_deviceptr (void *h) goacc_lazy_initialize (); struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *dev = thr-dev; + + gomp_mutex_lock (dev-lock); - n = lookup_host (thr-dev, h, 1); + n = lookup_host (dev, h, 1); if (!n) -return NULL; +{ + gomp_mutex_unlock (dev-lock); + return NULL; +} offset = h - n-host_start; d = n-tgt-tgt_start + n-tgt_offset + offset; + gomp_mutex_unlock (dev-lock); + return d; } Do we need to retain the lock while working with n? If not, the unlock could be placed right after the lookup_host, unconditionally. I'm confused -- it's commonly being done (retained) in target.c code, but not in the tgt_fn lookup in target.c:GOMP_target. I think the difference can be explained as follows: a given mapping (splay_key_tree_s) is essentially immutable after it is created (apart from the refcounts). Thus it can be safely accessed *so long as we know it will not be deallocated*. Now, in some parts of target.c, we have an active target_mem_desc, corresponding to a set of host-target mappings. So long as we are holding that target_mem_desc (e.g. as we are in GOMP_target_data), we know that none of the associated mappings' refcounts will fall to zero: so, we can access them (read only) safely without explicitly holding the lock. But, that's *not* the case for e.g. acc_deviceptr: that can be called at any point, in particular
[PATCH] Tidy up locking for libgomp OpenACC entry points
Hi, This patch is an attempt to fix some potential race conditions with accesses to shared data structures from multiple concurrent threads in libgomp's OpenACC entry points. The main change is to move locking out of lookup_host and lookup_dev in oacc-mem.c and into their callers (which can then hold the locks for the whole operation that they are performing). Also missing locking has been added for gomp_acc_insert_pointer. Tests look OK (with offloading to NVidia PTX). OK? (For the gomp4 branch, maybe, if trunk's not suitable at the moment?) Thanks, Julian ChangeLog libgomp/ * oacc-mem.c (lookup_host): Remove locking from function. Note locking requirement for caller in function comment. (lookup_dev): Likewise. (acc_free, acc_deviceptr, acc_hostptr, acc_is_present) (acc_map_data, acc_unmap_data, present_create_copy, delete_copyout) (update_dev_host, gomp_acc_insert_pointer, gomp_acc_remove_pointer): Add locking. commit 983e08e46be24380a52095851cd9c6eb481eb47c Author: Julian Brown jul...@codesourcery.com Date: Tue Apr 21 12:42:17 2015 -0700 More locking in oacc-mem.c diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c index 89ef5fc..d53af4b 100644 --- a/libgomp/oacc-mem.c +++ b/libgomp/oacc-mem.c @@ -35,7 +35,8 @@ #include stdint.h #include assert.h -/* Return block containing [H-S), or NULL if not contained. */ +/* Return block containing [H-S), or NULL if not contained. The device lock + for DEV must be locked on entry, and remains locked on exit. */ static splay_tree_key lookup_host (struct gomp_device_descr *dev, void *h, size_t s) @@ -46,9 +47,7 @@ lookup_host (struct gomp_device_descr *dev, void *h, size_t s) node.host_start = (uintptr_t) h; node.host_end = (uintptr_t) h + s; - gomp_mutex_lock (dev-lock); key = splay_tree_lookup (dev-mem_map, node); - gomp_mutex_unlock (dev-lock); return key; } @@ -56,7 +55,8 @@ lookup_host (struct gomp_device_descr *dev, void *h, size_t s) /* Return block containing [D-S), or NULL if not contained. The list isn't ordered by device address, so we have to iterate over the whole array. This is not expected to be a common - operation. */ + operation. The device lock associated with TGT must be locked on entry, and + remains locked on exit. */ static splay_tree_key lookup_dev (struct target_mem_desc *tgt, void *d, size_t s) @@ -67,16 +67,12 @@ lookup_dev (struct target_mem_desc *tgt, void *d, size_t s) if (!tgt) return NULL; - gomp_mutex_lock (tgt-device_descr-lock); - for (t = tgt; t != NULL; t = t-prev) { if (t-tgt_start = (uintptr_t) d t-tgt_end = (uintptr_t) d + s) break; } - gomp_mutex_unlock (tgt-device_descr-lock); - if (!t) return NULL; @@ -120,25 +116,32 @@ acc_free (void *d) { splay_tree_key k; struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr-dev; if (!d) return; assert (thr thr-dev); + gomp_mutex_lock (acc_dev-lock); + /* We don't have to call lazy open here, as the ptr value must have been returned by acc_malloc. It's not permitted to pass NULL in (unless you got that null from acc_malloc). */ - if ((k = lookup_dev (thr-dev-openacc.data_environ, d, 1))) - { - void *offset; + if ((k = lookup_dev (acc_dev-openacc.data_environ, d, 1))) +{ + void *offset; + + offset = d - k-tgt-tgt_start + k-tgt_offset; - offset = d - k-tgt-tgt_start + k-tgt_offset; + gomp_mutex_unlock (acc_dev-lock); - acc_unmap_data ((void *)(k-host_start + offset)); - } + acc_unmap_data ((void *)(k-host_start + offset)); +} + else +gomp_mutex_unlock (acc_dev-lock); - thr-dev-free_func (thr-dev-target_id, d); + acc_dev-free_func (acc_dev-target_id, d); } void @@ -178,16 +181,24 @@ acc_deviceptr (void *h) goacc_lazy_initialize (); struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *dev = thr-dev; + + gomp_mutex_lock (dev-lock); - n = lookup_host (thr-dev, h, 1); + n = lookup_host (dev, h, 1); if (!n) -return NULL; +{ + gomp_mutex_unlock (dev-lock); + return NULL; +} offset = h - n-host_start; d = n-tgt-tgt_start + n-tgt_offset + offset; + gomp_mutex_unlock (dev-lock); + return d; } @@ -204,16 +215,24 @@ acc_hostptr (void *d) goacc_lazy_initialize (); struct goacc_thread *thr = goacc_thread (); + struct gomp_device_descr *acc_dev = thr-dev; - n = lookup_dev (thr-dev-openacc.data_environ, d, 1); + gomp_mutex_lock (acc_dev-lock); + + n = lookup_dev (acc_dev-openacc.data_environ, d, 1); if (!n) -return NULL; +{ + gomp_mutex_unlock (acc_dev-lock); + return NULL; +} offset = d - n-tgt-tgt_start + n-tgt_offset; h = n-host_start + offset; + gomp_mutex_unlock (acc_dev-lock); + return h; } @@ -232,6 +251,8 @@ acc_is_present (void *h, size_t s) struct goacc_thread *thr
Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch) (PR65742)
On Tue, 14 Apr 2015 15:15:02 +0100 Julian Brown jul...@codesourcery.com wrote: On Wed, 8 Apr 2015 17:58:56 +0300 Ilya Verbin iver...@gmail.com wrote: On Wed, Apr 08, 2015 at 15:31:42 +0100, Julian Brown wrote: This version is mostly the same as the last posted version but has a tweak in GOACC_parallel to account for the new splay tree arrangement for target functions: - tgt_fn = (void (*)) tgt_fn_key-tgt-tgt_start; + tgt_fn = (void (*)) tgt_fn_key-tgt_offset; Have there been any other changes I might have missed? No. It passes libgomp testing on NVPTX. OK? Have you tested it with disabled offloading? I see several regressions: FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test I think there may be multiple issues here. The attached patch addresses one -- acc_device_type not distinguishing between offloaded and host code with the host_nonshm plugin. The patch appears to fix the original issue after all: I've re-run tests with host==target and the failures no longer appear. Also the same has been noted by Dominique d'Humieres in PR65742. The other problem is that it appears that the ACC_DEVICE_TYPE environment variable is not getting set properly on the target for (any of) the OpenACC tests: this means a lot of the time the wrong plugin is being tested, and means that the above tests (and several others) still fail. That will apparently need some more engineering (on our part). Fixing this turns out to require more DejaGNU-fu than I have: AFAICT, setting a per-test environment variable from an .exp file can't easily be done at present. The potentially useful-looking {dg-}set-target-env-var doesn't look quite suitable for this purpose, and besides which doesn't actually seem to be implemented for host != target anyway. (At least, if this fragment of gcc-dg.exp is anything to go by: if { [info exists set_target_env_var] \ [llength $set_target_env_var] != 0 } { if { [is_remote target] } { return [list unsupported ] } ... ). So: OK for trunk? Thanks, Julian ChangeLog libgomp/ * oacc-init.c (acc_on_device): Check whether we're in an offloaded region for host_nonshm plugin. * plugin/plugin-host.c (GOMP_OFFLOAD_openacc_parallel): Set nonshm_exec flag in thread-local data. (GOMP_OFFLOAD_openacc_create_thread_data): Allocate thread-local data for host_nonshm plugin. (+GOMP_OFFLOAD_openacc_destroy_thread_data): Free thread-local data for host_nonshm plugin. * plugin/plugin-host.h: New.
Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
On Wed, 8 Apr 2015 17:58:56 +0300 Ilya Verbin iver...@gmail.com wrote: On Wed, Apr 08, 2015 at 15:31:42 +0100, Julian Brown wrote: This version is mostly the same as the last posted version but has a tweak in GOACC_parallel to account for the new splay tree arrangement for target functions: - tgt_fn = (void (*)) tgt_fn_key-tgt-tgt_start; + tgt_fn = (void (*)) tgt_fn_key-tgt_offset; Have there been any other changes I might have missed? No. It passes libgomp testing on NVPTX. OK? Have you tested it with disabled offloading? I see several regressions: FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test I think there may be multiple issues here. The attached patch addresses one -- acc_device_type not distinguishing between offloaded and host code with the host_nonshm plugin. The other problem is that it appears that the ACC_DEVICE_TYPE environment variable is not getting set properly on the target for (any of) the OpenACC tests: this means a lot of the time the wrong plugin is being tested, and means that the above tests (and several others) still fail. That will apparently need some more engineering (on our part). (Not asking for review just yet, JFYI.) Julian ChangeLog libgomp/ * oacc-init.c (acc_on_device): Check whether we're in an offloaded region for host_nonshm plugin. * plugin/plugin-host.c (GOMP_OFFLOAD_openacc_parallel): Set nonshm_exec flag in thread-local data. (GOMP_OFFLOAD_openacc_create_thread_data): Allocate thread-local data for host_nonshm plugin. (+GOMP_OFFLOAD_openacc_destroy_thread_data): Free thread-local data for host_nonshm plugin. * plugin/plugin-host.h: New.Index: libgomp/oacc-init.c === --- libgomp/oacc-init.c (revision 221922) +++ libgomp/oacc-init.c (working copy) @@ -29,6 +29,7 @@ #include libgomp.h #include oacc-int.h #include openacc.h +#include plugin/plugin-host.h #include assert.h #include stdlib.h #include strings.h @@ -548,7 +549,14 @@ ialias (acc_set_device_num) int acc_on_device (acc_device_t dev) { - if (acc_get_device_type () == acc_device_host_nonshm) + struct goacc_thread *thr = goacc_thread (); + + /* We only want to appear to be the host_nonshm plugin from offloaded + code -- i.e. within a parallel region. Test a flag set by the + openacc_parallel hook of the host_nonshm plugin to determine that. */ + if (acc_get_device_type () == acc_device_host_nonshm + thr thr-target_tls + ((struct nonshm_thread *)thr-target_tls)-nonshm_exec) return dev == acc_device_host_nonshm || dev == acc_device_not_host; /* Just rely on the compiler builtin. */ Index: libgomp/plugin/plugin-host.c === --- libgomp/plugin/plugin-host.c (revision 221922) +++ libgomp/plugin/plugin-host.c (working copy) @@ -44,6 +44,7 @@ #include stdlib.h #include string.h #include stdio.h +#include stdbool.h #ifdef HOST_NONSHM_PLUGIN #define STATIC @@ -55,6 +56,10 @@ #define SELF host: #endif +#ifdef HOST_NONSHM_PLUGIN +#include plugin-host.h +#endif + STATIC const char * GOMP_OFFLOAD_get_name (void) { @@ -174,7 +179,10 @@ GOMP_OFFLOAD_openacc_parallel (void (*fn void *targ_mem_desc __attribute__ ((unused))) { #ifdef HOST_NONSHM_PLUGIN + struct nonshm_thread *thd = GOMP_PLUGIN_acc_thread (); + thd-nonshm_exec = true; fn (devaddrs); + thd-nonshm_exec = false; #else fn (hostaddrs); #endif @@ -232,11 +240,20 @@ STATIC void * GOMP_OFFLOAD_openacc_create_thread_data (int ord __attribute__ ((unused))) { +#ifdef HOST_NONSHM_PLUGIN + struct nonshm_thread *thd += GOMP_PLUGIN_malloc (sizeof (struct nonshm_thread)); + thd-nonshm_exec = false; + return thd; +#else return NULL; +#endif } STATIC void -GOMP_OFFLOAD_openacc_destroy_thread_data (void *tls_data - __attribute__ ((unused))) +GOMP_OFFLOAD_openacc_destroy_thread_data (void *tls_data) { +#ifdef HOST_NONSHM_PLUGIN + free (tls_data); +#endif } Index: libgomp/plugin/plugin-host.h === --- libgomp/plugin/plugin-host.h (revision 0) +++ libgomp/plugin/plugin-host.h (revision 0) @@ -0,0 +1,37 @@ +/* OpenACC Runtime Library: acc_device_host, acc_device_host_nonshm. + + Copyright (C) 2015 Free Software Foundation, Inc. + + Contributed by Mentor Embedded. + + This file is part of the GNU Offloading and Multi Processing Library + (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version
Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
On Tue, 7 Apr 2015 17:26:45 +0200 Jakub Jelinek ja...@redhat.com wrote: On Mon, Apr 06, 2015 at 03:45:57PM +0300, Ilya Verbin wrote: On Wed, Apr 01, 2015 at 15:20:25 +0200, Jakub Jelinek wrote: LGTM with proper ChangeLog entry. I've commited this patch into trunk. Julian, you probably want to update the nvptx plugin. Note that as the number of P1s without posted fixes is now zero, it is likely RC1 will be done this week, so if you want nvptx working in GCC 5, please post a fix as soon as possible. This version is mostly the same as the last posted version but has a tweak in GOACC_parallel to account for the new splay tree arrangement for target functions: - tgt_fn = (void (*)) tgt_fn_key-tgt-tgt_start; + tgt_fn = (void (*)) tgt_fn_key-tgt_offset; Have there been any other changes I might have missed? It passes libgomp testing on NVPTX. OK? Thanks, Juliancommit ac06b5e25e170061bb9855b9ea4b8e5696816bf1 Author: Julian Brown jul...@codesourcery.com Date: Tue Apr 7 09:23:58 2015 -0700 NVPTX load/unload and init-rework patch. diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c index 02c44b6..dbc68bc 100644 --- a/gcc/config/nvptx/mkoffload.c +++ b/gcc/config/nvptx/mkoffload.c @@ -839,6 +839,7 @@ process (FILE *in, FILE *out) { const char *input = read_file (in); Token *tok = tokenize (input); + unsigned int nvars = 0, nfuncs = 0; do tok = parse_file (tok); @@ -850,16 +851,17 @@ process (FILE *in, FILE *out) write_stmts (out, rev_stmts (fns)); fprintf (out, ;\n\n); fprintf (out, static const char *var_mappings[] = {\n); - for (id_map *id = var_ids; id; id = id-next) + for (id_map *id = var_ids; id; id = id-next, nvars++) fprintf (out, \t\%s\%s\n, id-ptx_name, id-next ? , : ); fprintf (out, };\n\n); fprintf (out, static const char *func_mappings[] = {\n); - for (id_map *id = func_ids; id; id = id-next) + for (id_map *id = func_ids; id; id = id-next, nfuncs++) fprintf (out, \t\%s\%s\n, id-ptx_name, id-next ? , : ); fprintf (out, };\n\n); fprintf (out, static const void *target_data[] = {\n); - fprintf (out, ptx_code, var_mappings, func_mappings\n); + fprintf (out, ptx_code, (void*) %u, var_mappings, (void*) %u, + func_mappings\n, nvars, nfuncs); fprintf (out, };\n\n); fprintf (out, extern void GOMP_offload_register (const void *, int, void *);\n); diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a1d42c5..5272f01 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -655,9 +655,6 @@ struct target_mem_desc { /* Corresponding target device descriptor. */ struct gomp_device_descr *device_descr; - /* Memory mapping info for the thread that created this descriptor. */ - struct splay_tree_s *mem_map; - /* List of splay keys to remove (or decrease refcount) at the end of region. */ splay_tree_key list[]; @@ -691,18 +688,6 @@ typedef struct acc_dispatch_t /* This is guarded by the lock in the outer struct gomp_device_descr. */ struct target_mem_desc *data_environ; - /* Extra information required for a device instance by a given target. */ - /* This is guarded by the lock in the outer struct gomp_device_descr. */ - void *target_data; - - /* Open or close a device instance. */ - void *(*open_device_func) (int n); - int (*close_device_func) (void *h); - - /* Set or get the device number. */ - int (*get_device_num_func) (void); - void (*set_device_num_func) (int); - /* Execute. */ void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *, unsigned short *, int, int, int, int, void *); @@ -720,7 +705,7 @@ typedef struct acc_dispatch_t void (*async_set_async_func) (int); /* Create/destroy TLS data. */ - void *(*create_thread_data_func) (void *); + void *(*create_thread_data_func) (int); void (*destroy_thread_data_func) (void *); /* NVIDIA target specific routines. */ diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c index 08b7c5e..1f5827e 100644 --- a/libgomp/oacc-async.c +++ b/libgomp/oacc-async.c @@ -26,7 +26,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see http://www.gnu.org/licenses/. */ - +#include assert.h #include openacc.h #include libgomp.h #include oacc-int.h @@ -37,13 +37,23 @@ acc_async_test (int async) if (async acc_async_sync) gomp_fatal (invalid async argument: %d, async); - return base_dev-openacc.async_test_func (async); + struct goacc_thread *thr = goacc_thread (); + + if (!thr || !thr-dev) +gomp_fatal (no device active); + + return thr-dev-openacc.async_test_func (async); } int acc_async_test_all (void) { - return base_dev-openacc.async_test_all_func (); + struct goacc_thread *thr = goacc_thread (); + + if (!thr || !thr-dev) +gomp_fatal (no device active); + + return thr-dev-openacc.async_test_all_func (); } void @@ -52,19 +62,34 @@ acc_wait (int async) if (async acc_async_sync
Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
On Wed, 8 Apr 2015 17:58:56 +0300 Ilya Verbin iver...@gmail.com wrote: Have you tested it with disabled offloading? I see several regressions: FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/acc_on_device-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test No -- thanks for the note. I've committed the patch now, but I'll try to get to looking at these in the next day or two (it's probably something relatively minor, I guess). Julian
Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
On Mon, 30 Mar 2015 18:42:02 +0200 Jakub Jelinek ja...@redhat.com wrote: On Thu, Mar 26, 2015 at 11:41:30PM +0300, Ilya Verbin wrote: Here is the latest patch for libgomp and mic plugin. make check-target-libgomp using intelmic emul passed. Also I used a testcase from the attachment. This applies cleanly. Latest ptx part is here, I guess: https://gcc.gnu.org/ml/gcc-patches/2015-02/msg01407.html But the one Julian posted doesn't apply on top of your patch. If there is any interdiff needed on top of your patch, can it be posted against trunk + your patch? Here's a version of my patch against trunk and Ilya's latest patch (hopefully!). Tests look OK (libgomp + PTX). HTH, Juliancommit f203634ace786b5bb2fdce56f123f3fba236dda3 Author: Julian Brown jul...@codesourcery.com Date: Mon Mar 30 14:37:53 2015 -0700 nvptx load/unload support, init rework diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c index 02c44b6..dbc68bc 100644 --- a/gcc/config/nvptx/mkoffload.c +++ b/gcc/config/nvptx/mkoffload.c @@ -839,6 +839,7 @@ process (FILE *in, FILE *out) { const char *input = read_file (in); Token *tok = tokenize (input); + unsigned int nvars = 0, nfuncs = 0; do tok = parse_file (tok); @@ -850,16 +851,17 @@ process (FILE *in, FILE *out) write_stmts (out, rev_stmts (fns)); fprintf (out, ;\n\n); fprintf (out, static const char *var_mappings[] = {\n); - for (id_map *id = var_ids; id; id = id-next) + for (id_map *id = var_ids; id; id = id-next, nvars++) fprintf (out, \t\%s\%s\n, id-ptx_name, id-next ? , : ); fprintf (out, };\n\n); fprintf (out, static const char *func_mappings[] = {\n); - for (id_map *id = func_ids; id; id = id-next) + for (id_map *id = func_ids; id; id = id-next, nfuncs++) fprintf (out, \t\%s\%s\n, id-ptx_name, id-next ? , : ); fprintf (out, };\n\n); fprintf (out, static const void *target_data[] = {\n); - fprintf (out, ptx_code, var_mappings, func_mappings\n); + fprintf (out, ptx_code, (void*) %u, var_mappings, (void*) %u, + func_mappings\n, nvars, nfuncs); fprintf (out, };\n\n); fprintf (out, extern void GOMP_offload_register (const void *, int, void *);\n); diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a1d42c5..5272f01 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -655,9 +655,6 @@ struct target_mem_desc { /* Corresponding target device descriptor. */ struct gomp_device_descr *device_descr; - /* Memory mapping info for the thread that created this descriptor. */ - struct splay_tree_s *mem_map; - /* List of splay keys to remove (or decrease refcount) at the end of region. */ splay_tree_key list[]; @@ -691,18 +688,6 @@ typedef struct acc_dispatch_t /* This is guarded by the lock in the outer struct gomp_device_descr. */ struct target_mem_desc *data_environ; - /* Extra information required for a device instance by a given target. */ - /* This is guarded by the lock in the outer struct gomp_device_descr. */ - void *target_data; - - /* Open or close a device instance. */ - void *(*open_device_func) (int n); - int (*close_device_func) (void *h); - - /* Set or get the device number. */ - int (*get_device_num_func) (void); - void (*set_device_num_func) (int); - /* Execute. */ void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *, unsigned short *, int, int, int, int, void *); @@ -720,7 +705,7 @@ typedef struct acc_dispatch_t void (*async_set_async_func) (int); /* Create/destroy TLS data. */ - void *(*create_thread_data_func) (void *); + void *(*create_thread_data_func) (int); void (*destroy_thread_data_func) (void *); /* NVIDIA target specific routines. */ diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c index 08b7c5e..1f5827e 100644 --- a/libgomp/oacc-async.c +++ b/libgomp/oacc-async.c @@ -26,7 +26,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see http://www.gnu.org/licenses/. */ - +#include assert.h #include openacc.h #include libgomp.h #include oacc-int.h @@ -37,13 +37,23 @@ acc_async_test (int async) if (async acc_async_sync) gomp_fatal (invalid async argument: %d, async); - return base_dev-openacc.async_test_func (async); + struct goacc_thread *thr = goacc_thread (); + + if (!thr || !thr-dev) +gomp_fatal (no device active); + + return thr-dev-openacc.async_test_func (async); } int acc_async_test_all (void) { - return base_dev-openacc.async_test_all_func (); + struct goacc_thread *thr = goacc_thread (); + + if (!thr || !thr-dev) +gomp_fatal (no device active); + + return thr-dev-openacc.async_test_all_func (); } void @@ -52,19 +62,34 @@ acc_wait (int async) if (async acc_async_sync) gomp_fatal (invalid async argument: %d, async); - base_dev-openacc.async_wait_func (async); + struct goacc_thread *thr = goacc_thread (); + + if (!thr || !thr-dev) +gomp_fatal (no device
Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
, present_create_copy) (delete_copyout, update_dev_host, gomp_acc_remove_pointer): Tweak lookup_host calls. * oacc-parallel.c (select_acc_device): Remove. Replace calls with goacc_lazy_initialize (throughout). (GOACC_parallel): Use lock and splay tree from gomp_device_descr not gomp_memory_mapping. * target.c (gomp_map_vars, gomp_copy_from_async, gomp_unmap_vars) (gomp_splay_tree_insert_mapping, GOMP_offload_unregister) (GOMP_target): Use splay tree and lock directly in gomp_device_descr, not gomp_memory_mapping. (gomp_update): Remove mm argument. Use splay tree and lock directly in gomp_device_descr. (gomp_free_memmap): Change argument to struct splay_tree_s. (gomp_load_plugin_for_device): Don't initialise openacc open_device, close_device, get_device_num or set_device_num hooks. Don't initialise target_data or deleted mem_map is_initialized, splay_tree.root fields. * plugin/plugin-host.c (GOMP_OFFLOAD_openacc_open_device) (GOMP_OFFLOAD_openacc_close_device) (GOMP_OFFLOAD_openacc_get_device_num) (GOMP_OFFLOAD_openacc_set_device_num): Remove. (GOMP_OFFLOAD_openacc_create_thread_data): Change (unused) argument to int. * plugin/plugin-nvptx.c (pthread.h): Include. (ptx_inited): Remove. (instantiated_devices, ptx_dev_lock): New. (struct ptx_image_data): New. (ptx_devices, ptx_images, ptx_image_lock): New. (fini_streams_for_device): Reorder cuStreamDestroy call. (nvptx_get_num_devices): Remove forward declaration. (nvptx_init): Change return type to bool. (nvptx_fini): Remove. (nvptx_attach_host_thread_to_device): New. (nvptx_open_device): Return struct ptx_device* instead of void*. (nvptx_close_device): Change argument type to struct ptx_device*, return type to void. (nvptx_get_num_devices): Use instantiated_devices not ptx_inited. (kernel_target_data, kernel_host_table): Remove static globals. (GOMP_OFFLOAD_register_image, GOMP_OFFLOAD_get_table): Remove. (GOMP_OFFLOAD_init_device): Reimplement. (GOMP_OFFLOAD_fini_device): Likewise. (GOMP_OFFLOAD_load_image, GOMP_OFFLOAD_unload_image): New. (GOMP_OFFLOAD_alloc, GOMP_OFFLOAD_free, GOMP_OFFLOAD_dev2host) (GOMP_OFFLOAD_host2dev): Use ORD argument. (GOMP_OFFLOAD_openacc_open_device) (GOMP_OFFLOAD_openacc_close_device) (GOMP_OFFLOAD_openacc_set_device_num) (GOMP_OFFLOAD_openacc_get_device_num): Remove. (GOMP_OFFLOAD_openacc_create_thread_data): Change argument to int (device number). libgomp/testsuite/ * libgomp.oacc-c-c++-common/lib-9.c: Fix devnum check in test.commit 63091061f227f124d8d496fd3064982935178f3a Author: Julian Brown jul...@codesourcery.com Date: Mon Feb 23 11:55:41 2015 -0800 nvptx load/unload image support, init rework fix multi-device tests more load/unload patch cleanups misc fixes diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c index 02c44b6..dbc68bc 100644 --- a/gcc/config/nvptx/mkoffload.c +++ b/gcc/config/nvptx/mkoffload.c @@ -839,6 +839,7 @@ process (FILE *in, FILE *out) { const char *input = read_file (in); Token *tok = tokenize (input); + unsigned int nvars = 0, nfuncs = 0; do tok = parse_file (tok); @@ -850,16 +851,17 @@ process (FILE *in, FILE *out) write_stmts (out, rev_stmts (fns)); fprintf (out, ;\n\n); fprintf (out, static const char *var_mappings[] = {\n); - for (id_map *id = var_ids; id; id = id-next) + for (id_map *id = var_ids; id; id = id-next, nvars++) fprintf (out, \t\%s\%s\n, id-ptx_name, id-next ? , : ); fprintf (out, };\n\n); fprintf (out, static const char *func_mappings[] = {\n); - for (id_map *id = func_ids; id; id = id-next) + for (id_map *id = func_ids; id; id = id-next, nfuncs++) fprintf (out, \t\%s\%s\n, id-ptx_name, id-next ? , : ); fprintf (out, };\n\n); fprintf (out, static const void *target_data[] = {\n); - fprintf (out, ptx_code, var_mappings, func_mappings\n); + fprintf (out, ptx_code, (void*) %u, var_mappings, (void*) %u, + func_mappings\n, nvars, nfuncs); fprintf (out, };\n\n); fprintf (out, extern void GOMP_offload_register (const void *, int, void *);\n); diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 3fc9aa9..822d2fe 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -656,9 +656,6 @@ struct target_mem_desc { /* Corresponding target device descriptor. */ struct gomp_device_descr *device_descr; - /* Memory mapping info for the thread that created this descriptor. */ - struct gomp_memory_mapping *mem_map; - /* List of splay keys to remove (or decrease refcount) at the end of region. */ splay_tree_key list[]; @@ -683,20 +680,6 @@ struct splay_tree_key_s { #include splay-tree.h -/* Information about mapped memory regions (per device/context). */ - -struct gomp_memory_mapping -{ - /* Mutex for operating with the splay tree and other shared
Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
On Fri, 6 Mar 2015 17:01:13 +0300 Ilya Verbin iver...@gmail.com wrote: On Thu, Feb 26, 2015 at 20:25:11 +0300, Ilya Verbin wrote: On Wed, Feb 25, 2015 at 10:36:08 +0100, Thomas Schwinge wrote: Julian Brown jul...@codesourcery.com wrote: This is a version of the previously-posted patch to rework initialisation and support the proposed load/unload hooks, merged to gomp4 branch and tested alongside the two patches (from Currently the 'struct gomp_memory_mapping' contains 'lock' and 'is_initialized'. Do you still need them? Or we can use gomp_device_descr::lock and is_initialized instead? If yes, then we can replace the gomp_memory_mapping structure with a splay_tree, as it was before the OpenACC merge. Ping? Apologies, I've been distracted with travel and other things. I suspect, as you suggest, that the gomp_memory_mapping lock/is_initialized fields may no longer be required. I haven't yet had time to address that nor all of Thomas's comments on the patch (mostly breakage with multiple devices), and I'm unlikely to have time this week either due to vacation... Thanks, Julian
Re: libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)
On Wed, 25 Feb 2015 10:36:08 +0100 Thomas Schwinge tho...@codesourcery.com wrote: Hi! On Tue, 24 Feb 2015 11:29:51 +, Julian Brown jul...@codesourcery.com wrote: Test results look OK, barring a suspected harness issue (lib-83 failing with a timeout for nvptx However, I'm seeing a class of testsuite regressions: all variants of libgomp.oacc-fortran/lib-5.f90 and libgomp.oacc-fortran/lib-7.f90 FAIL: »libgomp: cuMemFreeHost error: invalid value«. I see these two test cases contain a lot of acc_get_num_devices and similar calls -- I've been testing this on our nvidiak20-2 system, which contains two Nvidia K20 cards, so maybe there's something wrong in that regard. (But why is this failing only for Fortran -- are we missing C/C++ tests in that area?) Can you have a look, or want me to? I can have a look at that. --- a/gcc/config/nvptx/mkoffload.c +++ b/gcc/config/nvptx/mkoffload.c @@ -850,16 +851,17 @@ process (FILE *in, FILE *out) fprintf (out, static const void *target_data[] = {\n); - fprintf (out, ptx_code, var_mappings, func_mappings\n); + fprintf (out, ptx_code, (void*) %u, var_mappings, (void*) %u, + func_mappings\n, nvars, nfuncs); fprintf (out, };\n\n); I wondered if it's maybe more elegant to just separate those by NULL delimiters instead of the size integers casted to void * (spaces missing)? But then, that'd need double scanning in the consumer, libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_load_image, because we need to allocate an appropriately sized array, so maybe your more expressive approach is better indeed. Yeah, I considered both: there's probably not much to choose between the approaches. They use the same amount of space. --- a/libgomp/oacc-async.c +++ b/libgomp/oacc-async.c @@ -34,44 +34,68 @@ int acc_async_test (int async) { + struct goacc_thread *thr = goacc_thread (); + if (async acc_async_sync) gomp_fatal (invalid async argument: %d, async); - return base_dev-openacc.async_test_func (async); + assert (thr-dev); + + return thr-dev-openacc.async_test_func (async); } Here, and in several other places: is this code conforming to the OpenACC specification? Do we need to (lazily) initialize in all these places, or in goacc_thread, or gracefully fail (see below) if not initialized (basically in all places where you currently assert (thr-dev)? #include openacc.h int main(int argc, char *argv[]) { return acc_async_test(0); } [sigsegv] Whether it conforms to the spec or not is a hard question to answer, because a lot of behaviour is left undefined. But here are two possibly-useful made-up guidelines: 1. Does the program work the same with OpenACC disabled? 2. Does some strange use of OpenACC functionality (including library calls, etc.) probably indicate user error? Much of the lazy initialisation code is there so that (1) can be true -- i.e., a program can use OpenACC directives without making an explicit call to acc_init or other API-specific initialisation code. But this case is an explicit call to the OpenACC runtime library, so the program can't work without -fopenacc enabled, so we can follow guideline (2) instead. And in this case, it's meaningless to test for completion of async operation when no device is active. Of course though, this should be an actual error rather than a crash. But, I don't think we want to lazily-initialise here. Also, I'm not sure what the expected outcome of this code sequence is: acc_init(acc_device_nvidia); acc_shutdown(acc_device_nvidia); acc_async_test(0); a.out: [...]/source-gcc/libgomp/oacc-async.c:42: acc_async_test: Assertion `thr-dev' failed. Aborted (core dumped) If the OpenACC specification can be read such that all this indeed is undefined behavior, then aborting/crashing is OK, of course. Again, this would probably indicate user error in a real program, so it should raise a (real) error message. --- a/libgomp/oacc-cuda.c +++ b/libgomp/oacc-cuda.c @@ -34,51 +34,53 @@ void * acc_get_current_cuda_device (void) { - void *p = NULL; + struct goacc_thread *thr = goacc_thread (); - if (base_dev base_dev-openacc.cuda.get_current_device_func) -p = base_dev-openacc.cuda.get_current_device_func (); + if (thr thr-dev thr-dev-openacc.cuda.get_current_device_func) +return thr-dev-openacc.cuda.get_current_device_func (); - return p; + return NULL; } Here, and in other places, it looks as if we'd fail gracefully. Not sure about this (maybe it should be an error too?), but... int acc_set_cuda_stream (int async, void *stream) { - int s = -1; + struct goacc_thread *thr; if (async 0 || stream == NULL) return 0; goacc_lazy_initialize (); - if (base_dev base_dev-openacc.cuda.set_stream_func) -s = base_dev-openacc.cuda.set_stream_func
Re: Merge current set of OpenACC changes from gomp-4_0-branch
Hi, On Wed, 4 Feb 2015 15:05:45 + Julian Brown jul...@codesourcery.com wrote: The major changes are: * The removal of the OpenACC-specific plugin hooks open_device, close_device, set_device_num and get_device_num. The functionality has been moved into the init/fini hooks (for the first two) or moved into the target-independent OpenACC parts, respectively. * The PTX mkoffload utility has been extended to support variables as well as function mapping, to fill out support for the load/unload image hooks. (Not really tested so far!) * The plugin hooks that are shared between OpenMP and OpenACC now support the device number argument properly: that should help with (eventually) unifying the plugin interface for the two APIs. (With set_device_num and get_device_num removed, the plugin is stateless with respect to which device is currently active. The rest of the OpenACC hooks -- async functions, etc. -- should probably be changed to take a device number argument too, but that could be a follow-on patch.) * The limitation of having only one type of device active simultaneously in the OpenACC runtime has (theoretically!) been removed. This is a version of the previously-posted patch to rework initialisation and support the proposed load/unload hooks, merged to gomp4 branch and tested alongside the two patches (from https://gcc.gnu.org/wiki/Offloading#nvptx_Offloading): http://news.gmane.org/find-root.php?message_id=%3C20150218100035.GF1746%40tucnak.redhat.com%3E http://news.gmane.org/find-root.php?message_id=%3C546CF508.9010807%40codesourcery.com%3E As well as Ilya Verbin's patch: https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01605.html Test results look OK, barring a suspected harness issue (lib-83 failing with a timeout for nvptx, though it works fine from the command line). OK for gomp4 branch? I could commit Ilya's patch there too if so. Thanks, Julian ChangeLog gcc/ * config/nvptx/mkoffload.c (process): Support variable mapping. libgomp/ * libgomp.h (acc_dispatch_t): Remove open_device_func, close_device_func, get_device_num_func, set_device_num_func, target_data members. Change create_thread_data_func argument to device number instead of generic pointer. * oacc-async.c (assert.h): Include. (acc_async_test, acc_async_test_all, acc_wait, acc_wait_async) (acc_wait_all, acc_wait_all_async): Use current host thread's active device, not base_dev. * oacc-cuda.c (acc_get_current_cuda_device) (acc_get_current_cuda_context, acc_get_cuda_stream) (acc_set_cuda_stream): Likewise. * oacc-host.c (host_dispatch): Don't set open_device_func, close_device_func, get_device_num_func or set_device_num_func. * oacc-init.c (base_dev, init_key): Remove. (cached_base_dev): New. (name_of_acc_device_t): New. (acc_init_1): Initialise default-numbered device, not zeroth. (acc_shutdown_1): Close all devices of a given type. (goacc_destroy_thread): Don't use base_dev. (lazy_open, lazy_init, lazy_init_and_open): Remove. (goacc_attach_host_thread_to_device): New. (acc_init): Reimplement with goacc_attach_host_thread_to_device. (acc_get_num_devices): Don't use base_dev. (acc_set_device_type): Reimplement. (acc_get_device_type): Don't use base_dev. (acc_get_device_num): Tweak logic. (acc_set_device_num): Likewise. (goacc_runtime_initialize): Initialize cached_base_dev not base_dev. (goacc_lazy_initialize): Reimplement with acc_init and goacc_attach_host_thread_to_device. * oacc-int.h (goacc_thread): Add base_dev field. (base_dev): Remove extern declaration. (goacc_attach_host_thread_to_device): Add prototype. * oacc-mem.c (acc_malloc): Use current thread's device instead of base_dev. (acc_free): Likewise. (acc_memcpy_to_device): Likewise. (acc_memcpy_from_device): Likewise. * oacc-parallel.c (select_acc_device): Remove. Replace calls with goacc_lazy_initialize (throughout). * target.c (gomp_load_plugin_for_device): Don't initialise openacc open_device, close_device, get_device_num or set_device_num hooks. Don't initialise target_data. * plugin/plugin-host.c (GOMP_OFFLOAD_openacc_open_device) (GOMP_OFFLOAD_openacc_close_device) (GOMP_OFFLOAD_openacc_get_device_num) (GOMP_OFFLOAD_openacc_set_device_num): Remove. (GOMP_OFFLOAD_openacc_create_thread_data): Change (unused) argument to int. * plugin/plugin-nvptx.c (pthread.h): Include. (ptx_inited): Remove. (instantiated_devices, ptx_dev_lock): New. (struct ptx_image_data): New. (ptx_devices, ptx_images, ptx_image_lock): New. (nvptx_get_num_devices): Remove forward declaration. (nvptx_init): Change return type to bool. (nvptx_fini): Remove. (nvptx_attach_host_thread_to_device): New. (nvptx_open_device): Remove struct ptx_device* instead of void*. (nvptx_close_device): Change argument
Re: Merge current set of OpenACC changes from gomp-4_0-branch
On Tue, 3 Feb 2015 23:01:04 +0300 Ilya Verbin iver...@gmail.com wrote: On 03 Feb 13:00, Julian Brown wrote: On Tue, 3 Feb 2015 14:28:44 +0300 Ilya Verbin iver...@gmail.com wrote: On 27 Jan 14:07, Julian Brown wrote: On Mon, 26 Jan 2015 17:34:26 +0300 Ilya Verbin iver...@gmail.com wrote: Here is my current patch, it works for OpenMP-MIC, but obviously will not work for PTX, since it requires symmetrical changes in the plugin. Could you please take a look, whether it is possible to support this new interface in PTX plugin? I think it can probably be made to work. I'll have a look in more detail. Do you have any progress on this? I'm still working on a patch to update OpenACC support and the PTX backend to use load/unload_image and to unify initialisation/opening. So far I think the answer is basically yes, the new interface can be supported, though I might request a minor tweak -- e.g. that load_image takes an extra void ** argument so that a libgomp backend can allocate a block of generic metadata relating to the image, then that same block would be passed (void *) to the unload hook so the backend can use it there and deallocate it when it's finished with. Would that be possible? (It'd mostly be for a CUmodule handle: this could be stashed away somewhere within the nvptx backend, but it might be neater to put it in generic code since it'll probably be useful for other backends anyway.) An extra argument is not a problem, however I don't quite get the idea. PTX plugin allocates some data while loading, and needs this data while unloading? Then why not to create a hash table with image_ptr - metadata mapping inside the plugin? [...] Right -- that's what I meant by could be stashed away somewhere within the nvptx backend. I just thought that retaining a generic chunk of state for each (JIT-compiled, in this case) block of code might be something that would be useful for other targets too. I've kept the required information (for now at least) within the nvptx backend as an associative list. This (WIP) patch is based on top of a version of your patch that I merged to our internal branch: that's still the easiest way for me to test the PTX backend (with unloading support) at present, and it passes libgomp testing that way. Trunk should be fairly close, but I haven't tried applying it there yet. The major changes are: * The removal of the OpenACC-specific plugin hooks open_device, close_device, set_device_num and get_device_num. The functionality has been moved into the init/fini hooks (for the first two) or moved into the target-independent OpenACC parts, respectively. * The PTX mkoffload utility has been extended to support variables as well as function mapping, to fill out support for the load/unload image hooks. (Not really tested so far!) * The plugin hooks that are shared between OpenMP and OpenACC now support the device number argument properly: that should help with (eventually) unifying the plugin interface for the two APIs. (With set_device_num and get_device_num removed, the plugin is stateless with respect to which device is currently active. The rest of the OpenACC hooks -- async functions, etc. -- should probably be changed to take a device number argument too, but that could be a follow-on patch.) * The limitation of having only one type of device active simultaneously in the OpenACC runtime has (theoretically!) been removed. Thoughts? Thanks, Julian ChangeLog gcc/ * config/nvptx/mkoffload.c (process): Support variable mapping. libgomp/ * libgomp.h (acc_dispatch_t): Remove open_device_func, close_device_func, get_device_num_func, set_device_num_func, target_data members. Change create_thread_data_func argument to device number instead of generic pointer. * oacc-async.c (assert.h): Include. (acc_async_test, acc_async_test_all, acc_wait, acc_wait_async) (acc_wait_all, acc_wait_all_async): Use current host thread's active device, not base_dev. * oacc-cuda.c (acc_get_current_cuda_device) (acc_get_current_cuda_context, acc_get_cuda_stream) (acc_set_cuda_stream): Likewise. * oacc-host.c (host_dispatch): Don't set open_device_func, close_device_func, get_device_num_func or set_device_num_func. * oacc-init.c (base_dev, init_key): Remove. (cached_base_dev): New. (name_of_acc_device_t): New. (acc_init_1): Initialise default-numbered device, not zeroth. (acc_shutdown_1): Close all devices of a given type. (goacc_destroy_thread): Don't use base_dev. (lazy_open, lazy_init, lazy_init_and_open): Remove. (goacc_attach_host_thread_to_device): New. (acc_init): Reimplement with goacc_attach_host_thread_to_device. (acc_get_num_devices): Don't use base_dev. (acc_set_device_type): Reimplement. (acc_get_device_type): Don't use base_dev. (acc_get_device_num): Tweak
Re: Merge current set of OpenACC changes from gomp-4_0-branch
On Tue, 3 Feb 2015 14:28:44 +0300 Ilya Verbin iver...@gmail.com wrote: Hi Julian! On 27 Jan 14:07, Julian Brown wrote: On Mon, 26 Jan 2015 17:34:26 +0300 Ilya Verbin iver...@gmail.com wrote: Here is my current patch, it works for OpenMP-MIC, but obviously will not work for PTX, since it requires symmetrical changes in the plugin. Could you please take a look, whether it is possible to support this new interface in PTX plugin? I think it can probably be made to work. I'll have a look in more detail. Do you have any progress on this? I'm still working on a patch to update OpenACC support and the PTX backend to use load/unload_image and to unify initialisation/opening. So far I think the answer is basically yes, the new interface can be supported, though I might request a minor tweak -- e.g. that load_image takes an extra void ** argument so that a libgomp backend can allocate a block of generic metadata relating to the image, then that same block would be passed (void *) to the unload hook so the backend can use it there and deallocate it when it's finished with. Would that be possible? (It'd mostly be for a CUmodule handle: this could be stashed away somewhere within the nvptx backend, but it might be neater to put it in generic code since it'll probably be useful for other backends anyway.) Thanks, Julian
Re: Merge current set of OpenACC changes from gomp-4_0-branch
On Mon, 26 Jan 2015 17:34:26 +0300 Ilya Verbin iver...@gmail.com wrote: Here is my current patch, it works for OpenMP-MIC, but obviously will not work for PTX, since it requires symmetrical changes in the plugin. Could you please take a look, whether it is possible to support this new interface in PTX plugin? I think it can probably be made to work. I'll have a look in more detail. Thanks, Julian
Re: Merge current set of OpenACC changes from gomp-4_0-branch
On Mon, 26 Jan 2015 14:44:19 +0100 Thomas Schwinge tho...@codesourcery.com wrote: On 17 Jan 02:16, Ilya Verbin wrote: Unfortunately, it broke offloading from shared libraries (I mean common libs with NEEDED entries, not dlopened). Sorry for that! Such things are not covered by the testsuite, that's why you missed this issue. Here is a simple testcase: http://news.gmane.org/find-root.php?message_id=%3C20150116231632.GB48380%40msticlxl57.ims.intel.com%3E Probably a good motivation for adding such a test case. ;-) So, you don't assume that a device can have multiple images from multiple libs? Ping? This probably is just a bug that we introduced with our changes? (Julian?) AFAICR, we haven't yet figured out how to make (shared) libraries work with PTX. Actually I'm not entirely sure if static libraries containing PTX code will work either. But, multiple images (e.g. from different object files) are supported, via the loop in gomp_target_init. (The semantics of gomp_register_image_for_device were changed, but not -- intentionally! -- to limit the number of offloaded images to one.) Also, could you please explain, why did you divide a device initialization into two functions -- gomp_init_device and gomp_init_tables? As I understand it (again, Julian, please correct me if I got that wrong), the reason is that for OpenACC support, we need these as two separate (independent) actions. Is this causing problems for OpenMP offloading? This was certainly necessary at some point, when the support for multiple devices of the same type in the OpenACC runtime was delegated entirely to target-dependent code. Later (after one round of refactoring), the gomp_device_descr and the memory map were still separate, with the former possibly representing a number of devices, and the latter having independent copies for each instance of a device. That's largely been refactored (again) away now though -- a gomp_device_descr and its memory map are stored together, per-device instance. So this separation of their initialisation can probably go away, although some (somewhat delicate) code in oacc-init.c would need to be tweaked. Julian
Re: [PATCH 4/5] OpenACC 2.0 support for libgomp - new tests (repost)
On Sat, 15 Nov 2014 00:58:56 + Julian Brown jul...@codesourcery.com wrote: On Thu, 13 Nov 2014 11:15:18 +0100 Jakub Jelinek ja...@redhat.com wrote: +# Turn on OpenACC. +# XXX (TEMPORARY): Remove the -flto once that's properly integrated. +lappend ALWAYS_CFLAGS additional_flags=-fopenacc -flto Do you still need that? I'm not sure -- I can't easily check on trunk without the middle-end bits, and I haven't tried to incorporate those in my testing yet. I'll try to check this on e.g. the gomp4 branch soon. It seems that -flto *is* still needed at present -- I'm not sure what the plan was for integrating it properly. Making -fopenacc imply -flto via specs or similar? Thanks, Julian
Re: [PATCH 3/5] OpenACC 2.0 support for libgomp - outline documentation (repost)
On Thu, 13 Nov 2014 11:05:10 +0100 Tobias Burnus tobias.bur...@physik.fu-berlin.de wrote: Jakub Jelinek wrote: -* libgomp: (libgomp).GNU OpenMP runtime library +* libgomp: (libgomp).GNU OpenACC and OpenMP runtime library @end direntry See Dave Malcolm's patch, please integrate it into your patchset. Namely, https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01317.html However, a grep shows also the following spots which have to be updated: gcc/fortran/gfortran.texi-@option{-fopenmp}. This also arranges for automatic linking of the gcc/fortran/gfortran.texi:GNU OpenMP runtime library @ref{Top,,libgomp,libgomp,GNU OpenMP gcc/fortran/gfortran.texi-runtime library}. -- gcc/fortran/intrinsic.texi-@file{omp_lib.h}. The procedures provided by @code{OMP_LIB} can be found gcc/fortran/intrinsic.texi:in the @ref{Top,,Introduction,libgomp,GNU OpenMP runtime library} manual, gcc/fortran/intrinsic.texi-the named constants defined in the modules are listed -- gcc/doc/sourcebuild.texi-@item libgomp gcc/doc/sourcebuild.texi:The GNU OpenMP runtime library. gcc/doc/sourcebuild.texi- Thanks -- here's a new version of the patch, which incorporates David Malcolm's new backronym for libgomp, and edits the above files also. Juliancommit 06fc24fb9ffcf70aa49158f12db3f592bca5c3ff Author: Julian Brown jul...@codesourcery.com Date: Thu Nov 13 04:21:16 2014 -0800 OpenACC documentation. -xx-xx Thomas Schwinge tho...@codesourcery.com James Norris jnor...@codesourcery.com David Malcolm dmalc...@redhat.com Julian Brown jul...@codesourcery.com libgomp/ * libgomp.texi: Outline documentation for OpenACC. diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi index 20a206d..373dbb6 100644 --- a/gcc/doc/sourcebuild.texi +++ b/gcc/doc/sourcebuild.texi @@ -89,7 +89,7 @@ The Go runtime library. The bulk of this library is mirrored from the @uref{http://code.google.com/@/p/@/go/, master Go repository}. @item libgomp -The GNU OpenMP runtime library. +The GNU Offloading and Multi Processing library. @item libiberty The @code{libiberty} library, used for portability and for some diff --git a/gcc/fortran/intrinsic.texi b/gcc/fortran/intrinsic.texi index 90c9a3a..52db989 100644 --- a/gcc/fortran/intrinsic.texi +++ b/gcc/fortran/intrinsic.texi @@ -14030,8 +14030,8 @@ The OpenMP Fortran runtime library routines are provided both in a form of two Fortran 90 modules, named @code{OMP_LIB} and @code{OMP_LIB_KINDS}, and in a form of a Fortran @code{include} file named @file{omp_lib.h}. The procedures provided by @code{OMP_LIB} can be found -in the @ref{Top,,Introduction,libgomp,GNU OpenMP runtime library} manual, -the named constants defined in the modules are listed +in the @ref{Top,,Introduction,libgomp,GNU Offloading and Multi Processing +library} manual, the named constants defined in the modules are listed below. For details refer to the actual diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 254be57..4bd7ab8 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -31,11 +31,14 @@ texts being (a) (see below), and with the Back-Cover Texts being (b) @ifinfo @dircategory GNU Libraries @direntry -* libgomp: (libgomp).GNU OpenMP runtime library +* libgomp: (libgomp). GNU Offloading and Multi Processing Runtime library @end direntry -This manual documents the GNU implementation of the OpenMP API for -multi-platform shared-memory parallel programming in C/C++ and Fortran. +This manual documents libgomp, the GNU Offloading and Multi +Processing Runtime library. This is the GNU implementation of the OpenMP +API for multi-platform shared-memory parallel programming in C/C++ and +Fortran and of the OpenACC and OpenMP APIs for offloading of code to accelerator +devices from the same languages. Published by the Free Software Foundation 51 Franklin Street, Fifth Floor @@ -48,7 +51,7 @@ Boston, MA 02110-1301 USA @setchapternewpage odd @titlepage -@title The GNU OpenMP Implementation +@title The GNU OpenACC and OpenMP Implementation @page @vskip 0pt plus 1filll @comment For the @value{version-GCC} Version* @@ -69,7 +72,11 @@ Boston, MA 02110-1301, USA@* @top Introduction @cindex Introduction -This manual documents the usage of libgomp, the GNU implementation of the +This manual documents the usage of libgomp, the GNU Offloading and Multi +Processing Runtime library. This is the GNU implementation of the +@uref{http://www.openacc.org/, OpenACC} Application Programming Interface (API) +for offloading of code to accelerator devices in C/C++ and Fortran, and +the GNU implementation of the @uref{http://www.openmp.org, OpenMP} Application Programming Interface (API) for multi-platform shared-memory parallel programming in C/C++ and Fortran. @@ -81,23 +88,617 @@ for multi-platform shared-memory parallel programming in C/C
Re: [PATCH 1/5] OpenACC 2.0 support for libgomp - OpenACC runtime, NVidia PTX/CUDA plugin (repost)
On Wed, 12 Nov 2014 11:06:26 +0100 Jakub Jelinek ja...@redhat.com wrote: On Tue, Nov 11, 2014 at 01:53:23PM +, Julian Brown wrote: A few OpenMP tests fail with the new host_nonshm plugin (with failures of the form libgomp: Trying to update [0x605820..0x605824) object that is not mapped), probably because of middle-end bugs. I haven't investigated those in detail. Depends how exactly your host_nonshm plugin works. A few tests in the testsuite use #pragma omp declare target variables, so if host_nonshm plugin is something like I had on the gomp-4_0-branch initially as hackish device 257, where code is run on the host, and map directives simply malloc/free host memory and memcpy stuff around, then without extra work the #pragma omp declare target variables indeed can't work. You'd either need to support a strange partially shared memory model, where #pragma omp declare target variables would be shared (you'd still need to populate the mapping data structures with those vars and identity map them), or not so conforming model where you'd map them on entering the target regions if they aren't mapped yet (the thing is that then if the variables are changed on the host in between the start of the program and the target region, you'd use the changed values instead the values they were originally assigned), or map them in some constructor (but, how would you know if a host_nonshm plugin is going to be used in the future). Thanks for the review! I'll work on addressing your comments. Your characterization of the host_nonshm plugin sounds accurate, but OOI, what does the Intel MIC plugin do differently that means it is not subject to the same problem with target variables? One can always use the intelmicemul plugin to test nonshared-memory stuff without any HW (provided the host is x86_64/i686), so do we really need host_nonshm plugin? It might still be useful for testing (non-shm) OpenACC without hardware, I guess (or for pedagogical purposes) -- perhaps we could remove the TARGET_CAP_OPENMP_400 flag, if that's not expected to work. Julian
[PATCH 2/5] OpenACC 2.0 support for libgomp - temporarily work around missing __builtin_acc_on_device (repost)
On Tue, 23 Sep 2014 19:19:55 +0100 Julian Brown jul...@codesourcery.com wrote: The patches implementing __builtin_acc_on_device are still in processing. For the time being this patch removes the dependency on that builtin in the OpenACC runtime. Julian -xx-xx Julian Brown jul...@codesourcery.com libgomp/ * oacc-init.c (acc_on_device): Temporarily hard-code for host instead of using __builtin_acc_on_device. This patch remains unchanged from the last posting. OK to apply? JulianFrom 99e76023ff0759925403b43e19612fb859c3759e Mon Sep 17 00:00:00 2001 From: Julian Brown jul...@codesourcery.com Date: Fri, 19 Sep 2014 11:28:11 -0700 Subject: [PATCH 2/5] Work around lack of __builtin_acc_on_device for now -xx-xx Julian Brown jul...@codesourcery.com libgomp/ * oacc-init.c (acc_on_device): Temporarily hard-code for host instead of using __builtin_acc_on_device. --- libgomp/oacc-init.c | 12 1 file changed, 12 insertions(+) diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c index 8c91ea7..1cbb4d7 100644 --- a/libgomp/oacc-init.c +++ b/libgomp/oacc-init.c @@ -545,8 +545,20 @@ acc_on_device (acc_device_t dev) acc_device_type (thr-dev-type) == acc_device_host_nonshm) return dev == acc_device_host_nonshm || dev == acc_device_not_host; +#if 1 + /* Support for __builtin_acc_on_device comes in later patches. */ + switch (dev) +{ +case acc_device_none: +case acc_device_host: + return 1; +default: + return 0; +} +#else /* Just rely on the compiler builtin. */ return __builtin_acc_on_device (dev); +#endif } ialias (acc_on_device) -- 1.7.10.4
[PATCH 3/5] OpenACC 2.0 support for libgomp - outline documentation (repost)
On Tue, 23 Sep 2014 19:20:14 +0100 Julian Brown jul...@codesourcery.com wrote: This patch provides some documentation for the new OpenACC bits in libgomp. Julian -xx-xx Thomas Schwinge tho...@codesourcery.com James Norris jnor...@codesourcery.com libgomp/ * libgomp.texi: Outline documentation for OpenACC. This patch also remains unchanged from the last posting. OK to apply? JulianFrom 1f17beb70b5607d1884fad1cb4734857f0e7846f Mon Sep 17 00:00:00 2001 From: Julian Brown jul...@codesourcery.com Date: Mon, 22 Sep 2014 02:45:29 -0700 Subject: [PATCH 3/5] OpenACC documentation. -xx-xx Thomas Schwinge tho...@codesourcery.com James Norris jnor...@codesourcery.com libgomp/ * libgomp.texi: Outline documentation for OpenACC. --- libgomp/libgomp.texi | 661 -- 1 file changed, 636 insertions(+), 25 deletions(-) diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 254be57..9530a2b 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -31,10 +31,12 @@ texts being (a) (see below), and with the Back-Cover Texts being (b) @ifinfo @dircategory GNU Libraries @direntry -* libgomp: (libgomp).GNU OpenMP runtime library +* libgomp: (libgomp).GNU OpenACC and OpenMP runtime library @end direntry -This manual documents the GNU implementation of the OpenMP API for +This manual documents the GNU implementation of the OpenACC API for +offloading of code to accelerator devices in C/C++ and Fortran and +the GNU implementation of the OpenMP API for multi-platform shared-memory parallel programming in C/C++ and Fortran. Published by the Free Software Foundation @@ -48,7 +50,7 @@ Boston, MA 02110-1301 USA @setchapternewpage odd @titlepage -@title The GNU OpenMP Implementation +@title The GNU OpenACC and OpenMP Implementation @page @vskip 0pt plus 1filll @comment For the @value{version-GCC} Version* @@ -69,7 +71,10 @@ Boston, MA 02110-1301, USA@* @top Introduction @cindex Introduction -This manual documents the usage of libgomp, the GNU implementation of the +This manual documents the usage of libgomp, the GNU implementation of the +@uref{http://www.openacc.org/, OpenACC} Application Programming Interface (API) +for offloading of code to accelerator devices in C/C++ and Fortran, and +the GNU implementation of the @uref{http://www.openmp.org, OpenMP} Application Programming Interface (API) for multi-platform shared-memory parallel programming in C/C++ and Fortran. @@ -81,23 +86,619 @@ for multi-platform shared-memory parallel programming in C/C++ and Fortran. @comment better formatting. @comment @menu -* Enabling OpenMP::How to enable OpenMP for your applications. -* Runtime Library Routines:: The OpenMP runtime application programming - interface. -* Environment Variables:: Influencing runtime behavior with environment - variables. -* The libgomp ABI::Notes on the external ABI presented by libgomp. -* Reporting Bugs:: How to report bugs in GNU OpenMP. -* Copying::GNU general public license says - how you can copy and share libgomp. -* GNU Free Documentation License:: - How you can copy and share this manual. -* Funding::How to help assure continued work for free - software. -* Library Index:: Index of this documentation. +* Enabling OpenACC:: How to enable OpenACC for your + applications. +* OpenACC Runtime Library Routines:: The OpenACC runtime application + programming interface. +* OpenACC Environment Variables::Influencing OpenACC runtime behavior with + environment variables. +* OpenACC Library Interoperability:: OpenACC library interoperability with the + NVIDIA CUBLAS library. +* Enabling OpenMP:: How to enable OpenMP for your + applications. +* OpenMP Runtime Library Routines: Runtime Library Routines. + The OpenMP runtime application programming + interface. +* OpenMP Environment Variables: Environment Variables. + Influencing OpenMP runtime behavior with + environment variables. +* The libgomp ABI:: Notes on the external libgomp ABI. +* Reporting Bugs:: How to report bugs. +* Copying:: GNU general public license says how you + can copy and share libgomp. +* GNU Free Documentation License:: How you can copy and share
[PATCH 5/5] OpenACC 2.0 support for libgomp - temporary test harness tweaks
Hi, As mentioned in the previous mail in this series, testing the OpenACC runtime support in libgomp is going to be awkward until the associated middle-end pieces are ready. This stop-gap patch helps to allow tests (that don't use any of the pragmas, only calling the run-time library directly) to run successfully. OK to apply? Thanks, Julian ChangeLog libgomp/ * testsuite/libgomp.oacc-c++/c++.exp (ALWAYS_CFLAGS): Temporarily replace -fopenacc with -lgomp -lpthread, until -fopenacc support lands upstream. * testsuite/libgomp.oacc-c/c.exp (ALWAYS_CFLAGS): Likewise. * testsuite/libgomp.oacc-fortran/fortran.exp (ALWAYS_CFLAGS): Similar, but without -lpthread. From c70f2aca94bc306e4600282aa81bc1a758ad81fa Mon Sep 17 00:00:00 2001 From: Julian Brown jul...@codesourcery.com Date: Tue, 11 Nov 2014 02:54:09 -0800 Subject: [PATCH 5/5] Temporary testing tweaks libgomp/ * testsuite/libgomp.oacc-c++/c++.exp (ALWAYS_CFLAGS): Temporarily replace -fopenacc with -lgomp -lpthread, until -fopenacc support lands upstream. * testsuite/libgomp.oacc-c/c.exp (ALWAYS_CFLAGS): Likewise. * testsuite/libgomp.oacc-fortran/fortran.exp (ALWAYS_CFLAGS): Similar, but without -lpthread. --- libgomp/testsuite/libgomp.oacc-c++/c++.exp |4 +++- libgomp/testsuite/libgomp.oacc-c/c.exp |4 +++- libgomp/testsuite/libgomp.oacc-fortran/fortran.exp |4 +++- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/libgomp/testsuite/libgomp.oacc-c++/c++.exp b/libgomp/testsuite/libgomp.oacc-c++/c++.exp index b8b3e85..1060344 100644 --- a/libgomp/testsuite/libgomp.oacc-c++/c++.exp +++ b/libgomp/testsuite/libgomp.oacc-c++/c++.exp @@ -23,7 +23,9 @@ dg-init # Turn on OpenACC. # XXX (TEMPORARY): Remove the -flto once that's properly integrated. -lappend ALWAYS_CFLAGS additional_flags=-fopenacc -flto +#lappend ALWAYS_CFLAGS additional_flags=-fopenacc -flto +# TODO: Revert this temporary hack when OpenACC middle-end pieces are submitted. +lappend ALWAYS_CFLAGS additional_flags=-lgomp -flto -lpthread set blddir [lookfor_file [get_multilibs] libgomp] diff --git a/libgomp/testsuite/libgomp.oacc-c/c.exp b/libgomp/testsuite/libgomp.oacc-c/c.exp index 5558ec8..85528aa 100644 --- a/libgomp/testsuite/libgomp.oacc-c/c.exp +++ b/libgomp/testsuite/libgomp.oacc-c/c.exp @@ -28,7 +28,9 @@ dg-init # Turn on OpenACC. # XXX (TEMPORARY): Remove the -flto once that's properly integrated. -lappend ALWAYS_CFLAGS additional_flags=-fopenacc -flto +#lappend ALWAYS_CFLAGS additional_flags=-fopenacc -flto +# TODO: Revert temporary hack when OpenACC middle-end pieces are submitted. +lappend ALWAYS_CFLAGS additional_flags=-lgomp -flto -lpthread lappend libgomp_compile_options compiler=$GCC_UNDER_TEST diff --git a/libgomp/testsuite/libgomp.oacc-fortran/fortran.exp b/libgomp/testsuite/libgomp.oacc-fortran/fortran.exp index 0ada038..27cf4d5 100644 --- a/libgomp/testsuite/libgomp.oacc-fortran/fortran.exp +++ b/libgomp/testsuite/libgomp.oacc-fortran/fortran.exp @@ -23,7 +23,9 @@ dg-init # Turn on OpenACC. # XXX (TEMPORARY): Remove the -flto once that's properly integrated. -lappend ALWAYS_CFLAGS additional_flags=-fopenacc -flto +#lappend ALWAYS_CFLAGS additional_flags=-fopenacc -flto +# TODO: Revert this temporary hack when OpenACC middle-end pieces are submitted. +lappend ALWAYS_CFLAGS additional_flags=-lgomp -flto if { $blddir != } { set lang_source_re {^.*\.[fF](|90|95|03|08)$} -- 1.7.10.4
Re: [patch] OpenACC fortran front end
On Tue, 11 Nov 2014 08:10:29 +0100 Jakub Jelinek ja...@redhat.com wrote: On Mon, Nov 10, 2014 at 02:43:38PM -0800, Cesar Philippidis wrote: I'll post a separate patch with the fortran tests later. If anyone wants to test this patch, please use gomp-4_0-branch instead. You don't need a CUDA accelerator to use OpenACC, and some of the runtime tests will fail because that branch doesn't include the nvptx backend. Now that the first series of PTX target patches have been committed: I assume it is still true that nvptx doesn't work because the libgomp bits aren't in yes, isn't it? That's correct. The nvptx backend also depends on the offloading changes that a team from Intel is working on for the MIC target. But Julian should be posting the libgomp patches tomorrow, I think, since his changes are somewhat self-contained. For the middle-end and libgomp changes, can you talk to the Intel folks to update their git branch to latest trunk (so that you have the nvptx bits in there) and send middle-end and libgomp diffs against that? As far as I remember, most of the changes from the branch are now approved, they are just waiting for review of the LTO related changes in the middle-end (please, correct me if I've missed something). We've been preparing new patches against trunk for the libgomp and middle-end bits: I've now posted the former, and the latter are on their way soon, I believe. The middle-end bits are also present on the gomp-4_0-branch SVN branch (likewise, the libgomp pieces), and I believe we're planning to merge the PTX bits there also now they've been committed to trunk. Is it really worthwhile merging our patches to yet another branch at this stage? Thanks, Julian
Re: [patch] OpenACC fortran front end
On Tue, 11 Nov 2014 17:51:01 +0100 Jakub Jelinek ja...@redhat.com wrote: On Tue, Nov 11, 2014 at 02:52:20PM +, Julian Brown wrote: On Tue, 11 Nov 2014 08:10:29 +0100 Jakub Jelinek ja...@redhat.com wrote: We've been preparing new patches against trunk for the libgomp and middle-end bits: I've now posted the former, and the latter are on their way soon, I believe. The middle-end bits are also present on the gomp-4_0-branch SVN branch (likewise, the libgomp pieces), and I believe we're planning to merge the PTX bits there also now they've been committed to trunk. Is it really worthwhile merging our patches to yet another branch at this stage? The point is that the kyukhin/gomp4-offload branch is mostly reviewed now (waiting for Richard and/or Honza now to review the last LTO bits) and your patches have huge overlap with that, so sending patches against trunk that implement the same thing would mean reviewing the same bits again, and worse if there are conflicts between the two patchsets, if both patchsets were to be approved, one couldn't be committed anyway. Yeah, understood, and apologies for not making that clearer: as Cesar mentions, my patches are meant to apply (as well as I could manage) on top of Ilya's ones that have mostly been approved, and there should be no overlap in functionality (Ilya's patches subsume patches 1-6 in my previously-posted series). Our approach to branch management perhaps hasn't been perfect here -- it didn't dawn on me until quite late in the submission process that Intel had been working on their own branch rather than the gomp-4_0-branch, and that the patches they would be posting would be based on the former rather than the latter. But, we've tried hard to accommodate the differences that have arisen in the meantime. Thanks, Julian
Re: [gomp4] Move libgomp plugins into subdirectory
On Thu, 6 Nov 2014 10:06:00 +0100 Thomas Schwinge tho...@codesourcery.com wrote: Hi Julian! On Wed, 5 Nov 2014 17:57:10 +, Julian Brown jul...@codesourcery.com wrote: This patch moves plugin-nvptx.c and plugin-host.c (from oacc-host.c) into a new plugin subdirectory, as requested by Jakub, and to match more closely the layout of the Intel MIC pieces. This also moves the autotools bits to enable the NVPTX plugin and locate CUDA libraries into the plugin directory's (new) configury bits. Hmm. And then we cross-include files in libgomp/ from libgomp/plugin/ as well as the other way round (libgomp/oacc-host.c including libgomp/plugin/plugin-host.c, for example) -- whilst these two regimes are configured by two separate Autoconf instances? Is this really the intended scheme, or should we maybe rather have a top-level libgomp Autoconf/Automake system (as before), which is amended by libgomp/plugin/configfrag.ac and libgomp/plugin/Makefrag.am files that are included from libgomp/configure.ac and libgomp/Makefile.am? I don't know -- I was trying to follow existing practice (or how I imagine that to be) with regard to recursive autotools invocations (e.g. libjava/libltdl), and I have some FUD, probably misplaced, about how well non-recursive autotools works. A couple of the header files (oacc-plugin.h, libgomp-plugin.h) might be better placed within the plugin directory, but plugins will generally still need to include some headers direct from libgomp/. Maybe this reorg is just a bad idea? Test results look reasonable with my (patched for PTX support) version of the gomp4 branch. I'll apply it there shortly. Mid-air collision with my yesterday's libgomp changes -- with your patch in (r217162), gomp-4_0-branch doesn't even build; the files added/moved to libgomp/plugins/ are missing some of my changes. (I didn't look/compare in more detail.) Apologies, I thought I'd fixed those up, but it looks like I missed a bit. libgomp/ * Makefile.am (SUBDIRS): Add plugin. (DIST_SUBDIRS): Define. (libgomp_plugin_nvptx_*): Remove nvptx support from here. (libgomp_plugin_host_nonshm_*): Likewise. * Makefile.in: Regenerate. * configure: Regenerate. * oacc-host.c: Replace with #include of plugin/plugin-host.c code, move implementation to the latter. * plugin/plugin-host.c: New file. * plugin-nvptx.c: Move to... * plugin/plugin-nvptx.c: New file. * plugin/Makefile.am: New. * plugin/Makefile.in: Regenerate. * plugin/aclocal.m4: Regenerate. * plugin/configure: Regenerate. Please check in the regenerated libgomp/config.h.in, update contrib/gcc_update, and make generation of libgomp/testsuite/libgomp-test-support.exp work again, that is, substitution of @CUDA_DRIVER_INCLUDE@ and @CUDA_DRIVER_LIB@ (perhaps move instantiation from libgomp/configure.ac to libgomp/plugin/configure.ac). I'll fix this. Thanks, Julian
Re: [gomp4] Move libgomp plugins into subdirectory
On Thu, 6 Nov 2014 11:11:42 +0100 Jakub Jelinek ja...@redhat.com wrote: On Thu, Nov 06, 2014 at 10:06:00AM +0100, Thomas Schwinge wrote: Hi Julian! On Wed, 5 Nov 2014 17:57:10 +, Julian Brown jul...@codesourcery.com wrote: This patch moves plugin-nvptx.c and plugin-host.c (from oacc-host.c) into a new plugin subdirectory, as requested by Jakub, and to match more closely the layout of the Intel MIC pieces. This also moves the autotools bits to enable the NVPTX plugin and locate CUDA libraries into the plugin directory's (new) configury bits. Hmm. And then we cross-include files in libgomp/ from libgomp/plugin/ as well as the other way round (libgomp/oacc-host.c including libgomp/plugin/plugin-host.c, for example) -- whilst these two regimes are configured by two separate Autoconf instances? Is this really the intended scheme, or should we maybe rather have a top-level libgomp Autoconf/Automake system (as before), which is amended by libgomp/plugin/configfrag.ac and libgomp/plugin/Makefrag.am files that are included from libgomp/configure.ac and libgomp/Makefile.am? I'll apply the attached fixes for now in case anyone's blocked on the broken libgomp build, and then... I agree a plugin fragment into libgomp/configure.ac and/or libgomp/Makefile* is better. work on refactoring those configury bits (which will revert some of the attached, including moving libgomp-test-support.exp.in back to its previous location, but never mind). Thanks, Julian ChangeLog * contrib/gcc_update (libgomp/plugin/aclocal.m4) (libgomp/plugin/Makefile.in, libgomp/plugin/configure) (libgomp/plugin/config.h.in): Add. libgomp/ * oacc-init.c (resolve_device, _acc_init): Fix init_device_func hook naming. * plugin/plugin-host.c (GOMP_OFFLOAD_openacc_avail): Remove. (host_dispatch): Don't set avail_func hook. * plugin/configure.ac (libgomp-test-support.exp): Add to AC_CONFIG_FILES. * plugin/configure: Regenerate. * testsuite/libgomp-test-support.exp.in: Move from here... * plugin/libgomp-test-support.exp.in: ...to here. * plugin/Makefile.in: Regenerate. * testsuite/lib/libgomp.exp (libgomp-test-support.exp): Find in plugin dir, for now. * testsuite/Makefile.in: Regenerate. * configure.ac (testsuite/libgomp-test-support.exp): Remove from AC_CONFIG_FILES. * config.h.in: Regenerate. * configure: Regenerate. Index: libgomp/oacc-init.c === --- libgomp/oacc-init.c (revision 217192) +++ libgomp/oacc-init.c (working copy) @@ -97,7 +97,7 @@ resolve_device (acc_device_t d) while (++d != _ACC_device_hwm) if (dispatchers[d] !strcasecmp (goacc_device_type, dispatchers[d]-name) - dispatchers[d]-device_init_func () 0) + dispatchers[d]-init_device_func () 0) goto found; gomp_fatal (device type %s not supported, goacc_device_type); @@ -112,7 +112,7 @@ resolve_device (acc_device_t d) case acc_device_not_host: /* Find the first available device after acc_device_not_host. */ while (++d != _ACC_device_hwm) - if (dispatchers[d] dispatchers[d]-device_init_func () 0) + if (dispatchers[d] dispatchers[d]-init_device_func () 0) goto found; if (d_arg == acc_device_default) { @@ -140,7 +140,7 @@ resolve_device (acc_device_t d) } /* This is called when plugins have been initialized, and serves to call - (indirectly) the target's device_init hook. Calling multiple times without + (indirectly) the target's init_device hook. Calling multiple times without an intervening _acc_shutdown call is an error. */ static struct gomp_device_descr const * @@ -150,7 +150,7 @@ _acc_init (acc_device_t d) acc_dev = resolve_device (d); - if (!acc_dev || acc_dev-device_init_func () = 0) + if (!acc_dev || acc_dev-init_device_func () = 0) gomp_fatal (device %u not supported, (unsigned)d); if (acc_dev-is_initialized) Index: libgomp/plugin/plugin-host.c === --- libgomp/plugin/plugin-host.c (revision 217192) +++ libgomp/plugin/plugin-host.c (working copy) @@ -153,16 +153,6 @@ GOMP_OFFLOAD_get_table (struct mapping_t return 0; } -STATIC bool -GOMP_OFFLOAD_openacc_avail (void) -{ -#ifdef DEBUG - fprintf (stderr, SELF %s:%s\n, __FILE__, __FUNCTION__); -#endif - - return 1; -} - STATIC void * GOMP_OFFLOAD_openacc_open_device (int n) { @@ -415,9 +405,6 @@ static struct gomp_device_descr host_dis .get_device_num_func = GOMP_OFFLOAD_openacc_get_device_num, .set_device_num_func = GOMP_OFFLOAD_openacc_set_device_num, - /* Device available. */ - .avail_func = GOMP_OFFLOAD_openacc_avail, - .exec_func = GOMP_OFFLOAD_openacc_parallel, .register_async_cleanup_func Index: libgomp/plugin/configure.ac
Re: [gomp4] Move libgomp plugins into subdirectory
On Thu, 6 Nov 2014 15:37:42 + Julian Brown jul...@codesourcery.com wrote: On Thu, 6 Nov 2014 11:11:42 +0100 Jakub Jelinek ja...@redhat.com wrote: On Thu, Nov 06, 2014 at 10:06:00AM +0100, Thomas Schwinge wrote: Hmm. And then we cross-include files in libgomp/ from libgomp/plugin/ as well as the other way round (libgomp/oacc-host.c including libgomp/plugin/plugin-host.c, for example) -- whilst these two regimes are configured by two separate Autoconf instances? Is this really the intended scheme, or should we maybe rather have a top-level libgomp Autoconf/Automake system (as before), which is amended by libgomp/plugin/configfrag.ac and libgomp/plugin/Makefrag.am files that are included from libgomp/configure.ac and libgomp/Makefile.am? I agree a plugin fragment into libgomp/configure.ac and/or libgomp/Makefile* is better. [...] work on refactoring those configury bits (which will revert some of the attached, including moving libgomp-test-support.exp.in back to its previous location, but never mind). Does this look like what you had in mind? (I think liboffloadmic uses a similar recursive autotools invocation for its libgomp plugin -- maybe that wants refactoring too?). Thanks, Julian ChangeLog * contrib/gcc_update (libgomp/aclocal.m4, libgomp/Makefile.in) (libgomp/configure, libgomp/config.h.in): Add depends for plugin config fragments. (libgomp/plugin/aclocal.m4, libgomp/plugin/Makefile.in) (libgomp/plugin/configure, libgomp/plugin/config.h.in): Remove. libgomp/ * Makefile.am (SUBDIRS): Remove plugin subdir. (DIST_SUBDIRS): Delete. (search_path): Add ($top_srcdir)/../include. (AM_CPPFLAGS): Remove -I$(top_srcdir)/../include. (plugin/Makefrag.in): Include. * Makefile.in: Regenerate. * configure.ac (plugin): Remove from AC_CONFIG_SUBDIRS. (plugin/configfrag.ac): Include. (testsuite/libgomp-test-support.exp): Add to AC_CONFIG_FILES. * configure: Regenerate. * plugin/Makefile.am: Remove, refactor into... * plugin/Makefrag.am: ...this. New. * plugin/aclocal.m4: Remove. * plugin/config.h.in: Remove. * plugin/configure: Remove. * plugin/configure.ac: Remove, refactor into... * plugin/configfrag.ac: ...this. New. * plugin/libgomp-test-support-exp.in: Move back to... * testsuite/libgomp-test-support-exp.in: Here. * testsuite/lib/libgomp.exp (libgomp-test-support.exp): Include from current directory, not plugin dir.commit ea1335fc5a4aed75ad0f299969520f10e2f27435 Author: Julian Brown jul...@codesourcery.com Date: Thu Nov 6 11:54:25 2014 -0800 Don't use recursive autoconf/automake for libgomp plugins diff --git a/contrib/gcc_update b/contrib/gcc_update index a50dc8c..2903d7a 100755 --- a/contrib/gcc_update +++ b/contrib/gcc_update @@ -138,15 +138,11 @@ libjava/libltdl/config-h.in: libjava/libltdl/configure.ac libjava/libltdl/acloca libcpp/aclocal.m4: libcpp/configure.ac libcpp/Makefile.in: libcpp/configure.ac libcpp/aclocal.m4 libcpp/configure: libcpp/configure.ac libcpp/aclocal.m4 -libgomp/aclocal.m4: libgomp/configure.ac libgomp/acinclude.m4 -libgomp/Makefile.in: libgomp/Makefile.am libgomp/aclocal.m4 +libgomp/aclocal.m4: libgomp/configure.ac libgomp/acinclude.m4 libgomp/plugin/configfrag.ac +libgomp/Makefile.in: libgomp/Makefile.am libgomp/aclocal.m4 libgomp/plugin/Makefrag.am libgomp/testsuite/Makefile.in: libgomp/testsuite/Makefile.am libgomp/aclocal.m4 -libgomp/configure: libgomp/configure.ac libgomp/aclocal.m4 -libgomp/config.h.in: libgomp/configure.ac libgomp/aclocal.m4 -libgomp/plugin/aclocal.m4: libgomp/plugin/configure.ac -libgomp/plugin/Makefile.in: libgomp/plugin/Makefile.am libgomp/plugin/aclocal.m4 -libgomp/plugin/configure: libgomp/plugin/configure.ac libgomp/plugin/aclocal.m4 -libgomp/plugin/config.h.in: libgomp/plugin/configure.ac libgomp/plugin/aclocal.m4 +libgomp/configure: libgomp/configure.ac libgomp/aclocal.m4 libgomp/plugin/configfrag.ac +libgomp/config.h.in: libgomp/configure.ac libgomp/aclocal.m4 libgomp/plugin/configfrag.ac libitm/aclocal.m4: libitm/configure.ac libitm/acinclude.m4 libitm/Makefile.in: libitm/Makefile.am libitm/aclocal.m4 libitm/testsuite/Makefile.in: libitm/testsuite/Makefile.am libitm/aclocal.m4 diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am index f265c5d..dc2f88a 100644 --- a/libgomp/Makefile.am +++ b/libgomp/Makefile.am @@ -1,21 +1,21 @@ ## Process this file with automake to produce Makefile.in ACLOCAL_AMFLAGS = -I .. -I ../config -SUBDIRS = testsuite plugin -DIST_SUBDIRS = plugin +SUBDIRS = testsuite ## May be used by toolexeclibdir. gcc_version := $(shell cat $(top_srcdir)/../gcc/BASE-VER) config_path = @config_path@ -search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) +search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \ + $(top_srcdir)/../include fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version
[gomp4] Use GOMP_OFFLOAD_ prefix for (OpenACC) plugin hooks
Hi, Mirroring changes in Ilya Verbin's libgomp offloading pieces posted to trunk, this patch adds a prefix of GOMP_OFFLOAD_ to the OpenACC plugin hooks. Some of these bits will not be needed for a trunk version of the patch once Ilya's patch is approved (I'm hoping other incompatibilities haven't crept in other than the renaming!). I will apply to the gomp4 branch shortly. Thanks, Julian ChangeLog libgomp/ * oacc-host.c: Add GOMP_OFFLOAD_ prefix for plugin hooks. Rename device_init to init_device, device_fini to fini_device, offload_register to register_image and remove extraneous device_ from device_alloc, device_free, device_dev2host, device_host2dev and device_run. (host_dispatch): Use new names for hooks. * oacc-init.c: Use new names for hooks, throughout. * plugin-nvptx.c: Likewise. * target.c: Likewise. (gomp_load_plugin_for_device): Likewise. Look for new hook names. * target.h (gomp_device_descr): Use new hook names. commit 4e1b71a5e0d15de4c6e89ab5139964e32b563d68 Author: Julian Brown jul...@codesourcery.com Date: Wed Nov 5 02:34:22 2014 -0800 Use GOMP_OFFLOAD_ prefix for plugin hooks. diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c index fc3e77c..02794bb 100644 --- a/libgomp/oacc-host.c +++ b/libgomp/oacc-host.c @@ -60,7 +60,7 @@ static struct gomp_device_descr host_dispatch; #endif STATIC const char * -get_name (void) +GOMP_OFFLOAD_get_name (void) { #ifdef DEBUG fprintf (stderr, SELF %s:%s\n, __FILE__, __FUNCTION__); @@ -74,7 +74,7 @@ get_name (void) } STATIC int -get_type (void) +GOMP_OFFLOAD_get_type (void) { #ifdef DEBUG fprintf (stderr, SELF %s:%s\n, __FILE__, __FUNCTION__); @@ -88,7 +88,7 @@ get_type (void) } STATIC unsigned int -get_caps (void) +GOMP_OFFLOAD_get_caps (void) { unsigned int caps = TARGET_CAP_OPENACC_200 | TARGET_CAP_OPENMP_400 | TARGET_CAP_NATIVE_EXEC; @@ -105,7 +105,7 @@ get_caps (void) } STATIC int -get_num_devices (void) +GOMP_OFFLOAD_get_num_devices (void) { #ifdef DEBUG fprintf (stderr, SELF %s:%s\n, __FILE__, __FUNCTION__); @@ -115,7 +115,7 @@ get_num_devices (void) } STATIC void -offload_register (void *host_table, void *target_data) +GOMP_OFFLOAD_register_image (void *host_table, void *target_data) { #ifdef DEBUG fprintf (stderr, SELF %s:%s (%p, %p)\n, __FILE__, __FUNCTION__, host_table, @@ -124,17 +124,17 @@ offload_register (void *host_table, void *target_data) } STATIC int -device_init (void) +GOMP_OFFLOAD_init_device (void) { #ifdef DEBUG fprintf (stderr, SELF %s:%s\n, __FILE__, __FUNCTION__); #endif - return get_num_devices (); + return GOMP_OFFLOAD_get_num_devices (); } STATIC int -device_fini (void) +GOMP_OFFLOAD_fini_device (void) { #ifdef DEBUG fprintf (stderr, SELF %s:%s\n, __FILE__, __FUNCTION__); @@ -144,7 +144,7 @@ device_fini (void) } STATIC int -device_get_table (struct mapping_table **table) +GOMP_OFFLOAD_get_table (struct mapping_table **table) { #ifdef DEBUG fprintf (stderr, SELF %s:%s (%p)\n, __FILE__, __FUNCTION__, table); @@ -154,7 +154,7 @@ device_get_table (struct mapping_table **table) } STATIC bool -openacc_avail (void) +GOMP_OFFLOAD_openacc_avail (void) { #ifdef DEBUG fprintf (stderr, SELF %s:%s\n, __FILE__, __FUNCTION__); @@ -164,7 +164,7 @@ openacc_avail (void) } STATIC void * -openacc_open_device (int n) +GOMP_OFFLOAD_openacc_open_device (int n) { #ifdef DEBUG fprintf (stderr, SELF %s:%s (%u)\n, __FILE__, __FUNCTION__, n); @@ -174,7 +174,7 @@ openacc_open_device (int n) } STATIC int -openacc_close_device (void *hnd) +GOMP_OFFLOAD_openacc_close_device (void *hnd) { #ifdef DEBUG fprintf (stderr, SELF %s:%s (%p)\n, __FILE__, __FUNCTION__, hnd); @@ -184,7 +184,7 @@ openacc_close_device (void *hnd) } STATIC int -openacc_get_device_num (void) +GOMP_OFFLOAD_openacc_get_device_num (void) { #ifdef DEBUG fprintf (stderr, SELF %s:%s\n, __FILE__, __FUNCTION__); @@ -194,7 +194,7 @@ openacc_get_device_num (void) } STATIC void -openacc_set_device_num (int n) +GOMP_OFFLOAD_openacc_set_device_num (int n) { #ifdef DEBUG fprintf (stderr, SELF %s:%s (%u)\n, __FILE__, __FUNCTION__, n); @@ -205,7 +205,7 @@ openacc_set_device_num (int n) } STATIC void * -device_alloc (size_t s) +GOMP_OFFLOAD_alloc (size_t s) { void *ptr = GOMP(malloc) (s); @@ -217,7 +217,7 @@ device_alloc (size_t s) } STATIC void -device_free (void *p) +GOMP_OFFLOAD_free (void *p) { #ifdef DEBUG fprintf (stderr, SELF %s:%s (%p)\n, __FILE__, __FUNCTION__, p); @@ -227,7 +227,7 @@ device_free (void *p) } STATIC void * -device_host2dev (void *d, const void *h, size_t s) +GOMP_OFFLOAD_host2dev (void *d, const void *h, size_t s) { #ifdef DEBUG fprintf (stderr, SELF %s:%s (%p, %p, %zd)\n, __FILE__, __FUNCTION__, d, h, @@ -242,7 +242,7 @@ device_host2dev (void *d, const void *h, size_t s) } STATIC void * -device_dev2host (void *h, const void *d, size_t s
[gomp4] Move libgomp plugins into subdirectory
Hi, This patch moves plugin-nvptx.c and plugin-host.c (from oacc-host.c) into a new plugin subdirectory, as requested by Jakub, and to match more closely the layout of the Intel MIC pieces. This also moves the autotools bits to enable the NVPTX plugin and locate CUDA libraries into the plugin directory's (new) configury bits. So far this only changes the location of the source files: the plugins themselves are still installed to the same place as before (alongside libgomp itself). Test results look reasonable with my (patched for PTX support) version of the gomp4 branch. I'll apply it there shortly. Thanks, Julian ChangeLog libgomp/ * Makefile.am (SUBDIRS): Add plugin. (DIST_SUBDIRS): Define. (libgomp_plugin_nvptx_*): Remove nvptx support from here. (libgomp_plugin_host_nonshm_*): Likewise. * Makefile.in: Regenerate. * configure: Regenerate. * oacc-host.c: Replace with #include of plugin/plugin-host.c code, move implementation to the latter. * plugin/plugin-host.c: New file. * plugin-nvptx.c: Move to... * plugin/plugin-nvptx.c: New file. * plugin/Makefile.am: New. * plugin/Makefile.in: Regenerate. * plugin/aclocal.m4: Regenerate. * plugin/configure: Regenerate. commit 8994fb8c1b9d52cb9c82a61227a450df29e61806 Author: Julian Brown jul...@codesourcery.com Date: Wed Nov 5 02:54:30 2014 -0800 Move libgomp plugins into their own directory. diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am index e0ab763..f265c5d 100644 --- a/libgomp/Makefile.am +++ b/libgomp/Makefile.am @@ -1,7 +1,8 @@ ## Process this file with automake to produce Makefile.in ACLOCAL_AMFLAGS = -I .. -I ../config -SUBDIRS = testsuite +SUBDIRS = testsuite plugin +DIST_SUBDIRS = plugin ## May be used by toolexeclibdir. gcc_version := $(shell cat $(top_srcdir)/../gcc/BASE-VER) @@ -21,27 +22,6 @@ AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS) toolexeclib_LTLIBRARIES = libgomp.la nodist_toolexeclib_HEADERS = libgomp.spec -if PLUGIN_NVPTX -# Nvidia PTX OpenACC plugin. -libgomp_plugin_nvptx_version_info = -version-info $(libtool_VERSION) -toolexeclib_LTLIBRARIES += libgomp-plugin-nvptx.la -libgomp_plugin_nvptx_la_SOURCES = plugin-nvptx.c -libgomp_plugin_nvptx_la_CPPFLAGS = $(AM_CPPFLAGS) $(PLUGIN_NVPTX_CPPFLAGS) -libgomp_plugin_nvptx_la_LDFLAGS = $(libgomp_plugin_nvptx_version_info) \ - $(lt_host_flags) -libgomp_plugin_nvptx_la_LDFLAGS += $(PLUGIN_NVPTX_LDFLAGS) -libgomp_plugin_nvptx_la_LIBADD = $(PLUGIN_NVPTX_LIBS) -libgomp_plugin_nvptx_la_LIBTOOLFLAGS = --tag=disable-static -endif - -libgomp_plugin_host_nonshm_version_info = -version-info $(libtool_VERSION) -toolexeclib_LTLIBRARIES += libgomp-plugin-host_nonshm.la -libgomp_plugin_host_nonshm_la_SOURCES = oacc-host.c -libgomp_plugin_host_nonshm_la_CPPFLAGS = $(AM_CPPFLAGS) -DHOST_NONSHM_PLUGIN -libgomp_plugin_host_nonshm_la_LDFLAGS = \ - $(libgomp_plugin_host_nonshm_version_info) $(lt_host_flags) -libgomp_plugin_host_nonshm_la_LIBTOOLFLAGS = --tag=disable-static - if LIBGOMP_BUILD_VERSIONED_SHLIB # -Wc is only a libtool option. comma = , diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in index d12376e..ea3e1ca 100644 diff --git a/libgomp/configure b/libgomp/configure index 7daccd9..11a7ae0 100755 diff --git a/libgomp/configure.ac b/libgomp/configure.ac index 89c6b31..e883945 100644 --- a/libgomp/configure.ac +++ b/libgomp/configure.ac @@ -30,42 +30,6 @@ LIBGOMP_ENABLE(generated-files-in-srcdir, no, , AC_MSG_RESULT($enable_generated_files_in_srcdir) AM_CONDITIONAL(GENINSRC, test $enable_generated_files_in_srcdir = yes) -# Look for the CUDA driver package. -CUDA_DRIVER_INCLUDE= -CUDA_DRIVER_LIB= -AC_SUBST(CUDA_DRIVER_INCLUDE) -AC_SUBST(CUDA_DRIVER_LIB) -CUDA_DRIVER_CPPFLAGS= -CUDA_DRIVER_LDFLAGS= -AC_ARG_WITH(cuda-driver, - [AS_HELP_STRING([--with-cuda-driver=PATH], - [specify prefix directory for installed CUDA driver package. - Equivalent to --with-cuda-driver-include=PATH/include - plus --with-cuda-driver-lib=PATH/lib])]) -AC_ARG_WITH(cuda-driver-include, - [AS_HELP_STRING([--with-cuda-driver-include=PATH], - [specify directory for installed CUDA driver include files])]) -AC_ARG_WITH(cuda-driver-lib, - [AS_HELP_STRING([--with-cuda-driver-lib=PATH], - [specify directory for the installed CUDA driver library])]) -if test x$with_cuda_driver != x; then - CUDA_DRIVER_INCLUDE=$with_cuda_driver/include - CUDA_DRIVER_LIB=$with_cuda_driver/lib -fi -if test x$with_cuda_driver_include != x; then - CUDA_DRIVER_INCLUDE=$with_cuda_driver_include -fi -if test x$with_cuda_driver_lib != x; then - CUDA_DRIVER_LIB=$with_cuda_driver_lib -fi -if test x$CUDA_DRIVER_INCLUDE != x; then - CUDA_DRIVER_CPPFLAGS=-I$CUDA_DRIVER_INCLUDE -fi -if test x$CUDA_DRIVER_LIB != x; then - CUDA_DRIVER_LDFLAGS=-L$CUDA_DRIVER_LIB -fi - - # --- # --- @@ -241,52 +205,7 @@ elif test x$enable_accelerator != xno; then AC_MSG_ERROR([Can't have support for accelerators without support for plugins]) fi
Re: [gomp4] Use GOMP_OFFLOAD_ prefix for (OpenACC) plugin hooks
On Wed, 5 Nov 2014 22:02:33 +0300 Ilya Verbin iver...@gmail.com wrote: Hi, On 05 Nov 17:56, Julian Brown wrote: +GOMP_OFFLOAD_register_image (void *host_table, void *target_data) +GOMP_OFFLOAD_get_table (struct mapping_table **table) FYI, these interfaces may change in the near future. Currently GOMP_OFFLOAD_get_table returns a joint table for all images, offloaded to a device. But this doesn't work properly with offloading from dlopened libs. Do you plan to support such cases for PTX? Perhaps it's worth to replace them with a function like GOMP_OFFLOAD_load_image, which will offload one image, and return a target table for this image. In this case there is no need to pass host_table to the plugin, and return a joint table, since libgomp will join host and target tables itself. I made some changes to table initialisation on the gomp4 branch also -- probably not enough to genuinely support multiple devices, but hopefully some of the way there. Have you seen those? I haven't considered dlopened libs though. Another question is what to do with multiple devices of same type. Can they have different images? There are 2 options: 1. GOMP_OFFLOAD_load_image will offload one image to one device and receive a table from it. or 2. GOMP_OFFLOAD_register_image will register one image in the plugin for all devices of same type, and GOMP_OFFLOAD_get_table will return a table for one image and for one device. Similarly, I added (partial, in the case of OpenMP) support for multiple devices of the same type on the gomp4 branch. Thanks, Julian
Re: [gomp4] Rationalise thread-local variables in libgomp OpenACC support
On Tue, 28 Oct 2014 11:16:19 + Julian Brown jul...@codesourcery.com wrote: Hi, This patch rationalises TLS support by moving all thread-local variables into a single structure. Because this meant interfering with how per-thread/per-device initialisation was done, I took the opportunity to tidy up a couple of other bits along the way. Highlights are: Here's a slightly-updated version of the patch, adjusted for Thomas's removal of the queue.h list-handling functions. ChangeLog as before. Thanks, Juliancommit ab4e9ff7a52e43418d6d2fc5b5e76e0065e130d5 Author: Julian Brown jul...@codesourcery.com Date: Mon Oct 27 08:43:07 2014 -0700 TLS rework diff --git a/libgomp/env.c b/libgomp/env.c index 32fb92c..8b22e6f 100644 --- a/libgomp/env.c +++ b/libgomp/env.c @@ -28,6 +28,7 @@ #include libgomp.h #include libgomp_f.h #include target.h +#include oacc-int.h #include ctype.h #include stdlib.h #include stdio.h diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h index e31573c..1496437 100644 --- a/libgomp/libgomp-plugin.h +++ b/libgomp/libgomp-plugin.h @@ -50,8 +50,4 @@ extern void GOMP_PLUGIN_mutex_destroy (gomp_mutex_t *mutex); extern void GOMP_PLUGIN_mutex_lock (gomp_mutex_t *mutex); extern void GOMP_PLUGIN_mutex_unlock (gomp_mutex_t *mutex); -/* target.c */ - -extern void GOMP_PLUGIN_async_unmap_vars (void *ptr); - #endif diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index 538aabb..c6a88a2 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -337,4 +337,5 @@ PLUGIN_1.0 { GOMP_PLUGIN_mutex_lock; GOMP_PLUGIN_mutex_unlock; GOMP_PLUGIN_async_unmap_vars; + GOMP_PLUGIN_acc_thread; }; diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c index 08b6b95..dddfe05 100644 --- a/libgomp/oacc-async.c +++ b/libgomp/oacc-async.c @@ -29,6 +29,7 @@ #include openacc.h #include libgomp.h #include target.h +#include oacc-int.h int acc_async_test (int async) @@ -36,13 +37,13 @@ acc_async_test (int async) if (async acc_async_sync) gomp_fatal (invalid async argument: %d, async); - return ACC_dev-openacc.async_test_func (async); + return base_dev-openacc.async_test_func (async); } int acc_async_test_all (void) { - return ACC_dev-openacc.async_test_all_func (); + return base_dev-openacc.async_test_all_func (); } void @@ -51,22 +52,19 @@ acc_wait (int async) if (async acc_async_sync) gomp_fatal (invalid async argument: %d, async); - ACC_dev-openacc.async_wait_func (async); - return; + base_dev-openacc.async_wait_func (async); } void acc_wait_async (int async1, int async2) { - ACC_dev-openacc.async_wait_async_func (async1, async2); - return; + base_dev-openacc.async_wait_async_func (async1, async2); } void acc_wait_all (void) { - ACC_dev-openacc.async_wait_all_func (); - return; + base_dev-openacc.async_wait_all_func (); } void @@ -75,6 +73,5 @@ acc_wait_all_async (int async) if (async acc_async_sync) gomp_fatal (invalid async argument: %d, async); - ACC_dev-openacc.async_wait_all_async_func (async); - return; + base_dev-openacc.async_wait_all_async_func (async); } diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c index f587325..3daf5b1 100644 --- a/libgomp/oacc-cuda.c +++ b/libgomp/oacc-cuda.c @@ -29,14 +29,15 @@ #include config.h #include libgomp.h #include target.h +#include oacc-int.h void * acc_get_current_cuda_device (void) { void *p = NULL; - if (ACC_dev ACC_dev-openacc.cuda.get_current_device_func) -p = ACC_dev-openacc.cuda.get_current_device_func (); + if (base_dev base_dev-openacc.cuda.get_current_device_func) +p = base_dev-openacc.cuda.get_current_device_func (); return p; } @@ -46,8 +47,8 @@ acc_get_current_cuda_context (void) { void *p = NULL; - if (ACC_dev ACC_dev-openacc.cuda.get_current_context_func) -p = ACC_dev-openacc.cuda.get_current_context_func (); + if (base_dev base_dev-openacc.cuda.get_current_context_func) +p = base_dev-openacc.cuda.get_current_context_func (); return p; } @@ -60,8 +61,8 @@ acc_get_cuda_stream (int async) if (async 0) return p; - if (ACC_dev ACC_dev-openacc.cuda.get_stream_func) -p = ACC_dev-openacc.cuda.get_stream_func (async); + if (base_dev base_dev-openacc.cuda.get_stream_func) +p = base_dev-openacc.cuda.get_stream_func (async); return p; } @@ -73,9 +74,11 @@ acc_set_cuda_stream (int async, void *stream) if (async 0 || stream == NULL) return 0; + + ACC_lazy_initialize (); - if (ACC_dev ACC_dev-openacc.cuda.set_stream_func) -s = ACC_dev-openacc.cuda.set_stream_func (async, stream); + if (base_dev base_dev-openacc.cuda.set_stream_func) +s = base_dev-openacc.cuda.set_stream_func (async, stream); return s; } diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c index f44ca5e..6fe8f6c 100644 --- a/libgomp/oacc-host.c +++ b/libgomp/oacc-host.c @@ -35,6 +35,9 @@ #include target.h #ifdef HOST_NONSHM_PLUGIN #include libgomp
[gomp4] Rationalise thread-local variables in libgomp OpenACC support
memmap_t argument to struct gomp_memory_mapping. (lookup_dev): Change memmap_t argument to struct target_mem_desc. Use list_count not refcount for iterating over mapped elements. (acc_malloc): Use base_dev not ACC_dev. (acc_free): Update call to lookup_dev. Use base_dev not ACC_dev. (acc_memcpy_to_device, acc_memcpy_from_device): Use base_dev not ACC_dev. (acc_deviceptr, acc_is_present): Update call to lookup_host. (acc_hostptr): Update call to lookup_dev. (acc_map_data): Look up thread device instead of using ACC_dev, update calls to lookup_host, lookup_dev. Use data environment in device descriptor. (acc_unmap_data): Update call to lookup_host. Remove mapped data from data environment not ACC_memmap. (present_create_copy): Update call to lookup_host. Use data environment instead of list in ACC_memmap. (delete_copyout): Update call to lookup_host. Look up device in current thread info instead of using ACC_dev. (update_dev_host): Look up device in current thread info instead of using ACC_dev. * oacc-parallel.c (oacc-int.h): Include. (struct devgeom, devgeom, dump_devaddrs): Remove. (select_acc_device): Call ACC_lazy_initialize earlier. (GOACC_parallel): Use device for current thread instead of ACC_dev. Use memory map from current device. (GOACC_data_start): Likewise. Use thread info block for mapped data. (GOACC_data_end): Use thread info block for mapped data. (goacc_wait): Use device for current thread instead of ACC_dev. (GOACC_update): Likewise. Formatting fixes. * oacc-plugin.c (ACC_plugin_register): Remove. (oacc-int.h): Include. (GOMP_PLUGIN_acc_thread): New. * oacc-plugin.h (target.h): Don't include. (ACC_plugin_register): Remove. (GOMP_PLUGIN_async_unmap_vars, GOMP_PLUGIN_acc_thread): Add extern declarations. * plugin-nvptx.c (oacc-plugin.h): Include. (current_stream, PTX_dev, PTX_devices): Remove. (struct nvptx_thread): New. (nvptx_thread): New function. (select_stream_for_async): Locate ptx_dev in device-specific TLS data instead of using TLS PTX_dev variable. (PTX_init): Don't initialize PTX_devices. (PTX_open_device): Remove PTX_devices list handling. Tweak context initialization. (PTX_close_device): Remove PTX_devices list handling. Find PTX device info via function argument instead of global TLS variable. (PTX_get_num_devices): Make callable when backend has not been initialized. (event_gc): Find PTX device info, current stream via nvptx_thread. (event_add, PTX_exec, PTX_host2dev, PTX_dev2host) (PTX_async_test_all, PTX_wait_all, PTX_wait_all_async) (PTX_get_current_cuda_device, PTX_get_current_cuda_context) (PTX_get_cuda_stream, PTX_set_cuda_stream, openacc_close_device) (openacc_set_device_num, openacc_register_async_cleanup) (openacc_async_set_async): Likewise. (openacc_create_thread_data, openacc_destroy_thread_data): New. * target.c (oacc-int.h): Include. (gomp_fini_device): Split out memory-map freeing into... (gomp_free_memmap): ...this new function. (gomp_load_plugin_for_device): Initialize openacc.create_thread_data_func, openacc.destroy_thread_data_func hooks. (gomp_find_available_plugins): Initialize one target_device_descr per physical device. * target.h (oacc-int.h): Don't include. (ACC_dispatch_t): Declare here. Add data_environ, ord fields. Update comment for mem_map field. (gomp_free_memmap): Add prototype. commit 898dba8e56827d7dde964e63f53c804c59674e9b Author: Julian Brown jul...@codesourcery.com Date: Mon Oct 27 08:43:07 2014 -0700 TLS rework diff --git a/libgomp/env.c b/libgomp/env.c index 32fb92c..8b22e6f 100644 --- a/libgomp/env.c +++ b/libgomp/env.c @@ -28,6 +28,7 @@ #include libgomp.h #include libgomp_f.h #include target.h +#include oacc-int.h #include ctype.h #include stdlib.h #include stdio.h diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h index e31573c..1496437 100644 --- a/libgomp/libgomp-plugin.h +++ b/libgomp/libgomp-plugin.h @@ -50,8 +50,4 @@ extern void GOMP_PLUGIN_mutex_destroy (gomp_mutex_t *mutex); extern void GOMP_PLUGIN_mutex_lock (gomp_mutex_t *mutex); extern void GOMP_PLUGIN_mutex_unlock (gomp_mutex_t *mutex); -/* target.c */ - -extern void GOMP_PLUGIN_async_unmap_vars (void *ptr); - #endif diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index 538aabb..c6a88a2 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -337,4 +337,5 @@ PLUGIN_1.0 { GOMP_PLUGIN_mutex_lock; GOMP_PLUGIN_mutex_unlock; GOMP_PLUGIN_async_unmap_vars; + GOMP_PLUGIN_acc_thread; }; diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c index 08b6b95..dddfe05 100644 --- a/libgomp/oacc-async.c +++ b/libgomp/oacc-async.c @@ -29,6 +29,7 @@ #include openacc.h #include libgomp.h #include target.h +#include oacc-int.h int acc_async_test (int async) @@ -36,13
[gomp4] Remove goacc_parse_device_num
Hi, This patch removes the goacc_parse_device_num function in libgomp's env.c since it is redundant with parse_int. I also added some bounds checking for the device number in oacc-init.c (the behaviour is left as implementation defined in the OpenACC 2.0 spec, so I chose to raise an error for an out-of-range device number). OK for gomp4 branch? Thanks, Julian ChangeLog libgomp/ * env.c (goacc_parse_device_num): Remove. (initialize_env): Use parse_int instead of goacc_parse_device_num. * oacc-init.c (lazy_open): Add bounds check for device number.commit 1dacb833b33d179553723faecf4b32e89efc69a9 Author: Julian Brown jul...@codesourcery.com Date: Tue Oct 28 06:03:47 2014 -0700 ACC_DEVICE_NUM tweaks diff --git a/libgomp/env.c b/libgomp/env.c index 8b22e6f..02bce0c 100644 --- a/libgomp/env.c +++ b/libgomp/env.c @@ -1016,27 +1016,6 @@ parse_affinity (bool ignore) return false; } - -static void -goacc_parse_device_num (void) -{ - const char *env = getenv (ACC_DEVICE_NUM); - int default_num = -1; - - if (env *env != '\0') -{ - char *end; - default_num = strtol (env, end, 0); - - if (*end || default_num 0) -default_num = 0; -} - else -default_num = 0; - - goacc_device_num = default_num; -} - static void goacc_parse_device_type (void) { @@ -1310,7 +1289,9 @@ initialize_env (void) handle_omp_display_env (stacksize, wait_policy); /* Look for OpenACC-specific environment variables. */ - goacc_parse_device_num (); + if (!parse_int (ACC_DEVICE_NUM, goacc_device_num, true)) +goacc_device_num = 0; + goacc_parse_device_type (); /* Initialize OpenACC-specific internal state. */ diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c index 489ac14..24e911b 100644 --- a/libgomp/oacc-init.c +++ b/libgomp/oacc-init.c @@ -249,6 +249,9 @@ lazy_open (int ord) if (ord 0) ord = goacc_device_num; + if (ord = base_dev-get_num_devices_func ()) +gomp_fatal (device %u does not exist, ord); + if (!thr) thr = goacc_new_thread ();
[gomp4] Don't put acc_notify_var in thread-local struct
Hi, This patch moves acc_notify_var out of gomp_task_icv and makes it simply a global variable instead. OK for gomp4 branch? Thanks, Julian ChangeLog libgomp/ * env.c (goacc_notify_var): New. (initialize_env): Use above instead of gomp_global_icv.acc_notify_var. * error.c (gomp_vnotify): Use goacc_notify_var. (gomp_notify): Fix formatting. * libgomp.h (gomp_task_icv): Remove acc_notify_var field. (goacc_notify_var): Add extern declaration.commit 5b18c3e134779ee562af11702d2ba2c4baa66370 Author: Julian Brown jul...@codesourcery.com Date: Tue Oct 28 06:45:41 2014 -0700 acc_notify_var tweaks diff --git a/libgomp/env.c b/libgomp/env.c index 02bce0c..03206dd 100644 --- a/libgomp/env.c +++ b/libgomp/env.c @@ -79,6 +79,7 @@ unsigned long gomp_bind_var_list_len; void **gomp_places_list; unsigned long gomp_places_list_len; +int goacc_notify_var; int goacc_device_num; char* goacc_device_type; @@ -1196,7 +1197,7 @@ initialize_env (void) gomp_global_icv.thread_limit_var = thread_limit_var INT_MAX ? UINT_MAX : thread_limit_var; } - parse_int (GCC_ACC_NOTIFY, gomp_global_icv.acc_notify_var, true); + parse_int (GCC_ACC_NOTIFY, goacc_notify_var, true); #ifndef HAVE_SYNC_BUILTINS gomp_mutex_init (gomp_managed_threads_lock); #endif diff --git a/libgomp/error.c b/libgomp/error.c index 5f400cc..320b4d2 100644 --- a/libgomp/error.c +++ b/libgomp/error.c @@ -76,13 +76,12 @@ gomp_fatal (const char *fmt, ...) void gomp_vnotify (const char *msg, va_list list) { - struct gomp_task_icv *icv = gomp_icv (false); - if (icv-acc_notify_var) + if (goacc_notify_var) vfprintf (stderr, msg, list); } void -gomp_notify(const char *msg, ...) +gomp_notify (const char *msg, ...) { va_list list; diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 8b7327d..206b293 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -238,7 +238,6 @@ struct gomp_task_icv bool dyn_var; bool nest_var; char bind_var; - int acc_notify_var; /* Internal ICV. */ struct target_mem_desc *target_data; }; @@ -257,6 +256,7 @@ extern unsigned long gomp_bind_var_list_len; extern void **gomp_places_list; extern unsigned long gomp_places_list_len; +extern int goacc_notify_var; extern int goacc_device_num; extern char* goacc_device_type;
[gomp4] Remove redundant get_caps hook invocations
Hi, This patch causes the get_caps hook to be called only once during device initialisation, and caches the result in the device's capabilities field. OK for gomp4 branch? Thanks, Julian ChangeLog libgomp/ * target.c (gomp_load_plugin_for_device): Only call get_caps once. (gomp_find_available_plugins): ...and don't call it again here.commit 271ee70eec93866e312c7b9363cb0e736b6361d3 Author: Julian Brown jul...@codesourcery.com Date: Tue Oct 28 07:14:19 2014 -0700 Remove redundant get_caps calls. diff --git a/libgomp/target.c b/libgomp/target.c index 73a186b..615ba6b 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -1036,9 +1036,10 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, DLSYM (device_free); DLSYM (device_dev2host); DLSYM (device_host2dev); - if (device-get_caps_func () TARGET_CAP_OPENMP_400) + device-capabilities = device-get_caps_func (); + if (device-capabilities TARGET_CAP_OPENMP_400) DLSYM (device_run); - if (device-get_caps_func () TARGET_CAP_OPENACC_200) + if (device-capabilities TARGET_CAP_OPENACC_200) { optional_present = optional_total = 0; DLSYM_OPT (openacc.exec, openacc_parallel); @@ -1167,7 +1168,6 @@ gomp_find_available_plugins (void) devicep-mem_map.is_initialized = false; devicep-type = devicep-get_type_func (); devicep-name = devicep-get_name_func (); - devicep-capabilities = devicep-get_caps_func (); gomp_mutex_init (devicep-mem_map.lock); devicep-ord = i; devicep-target_data = NULL;
[gomp4] Remove stray debugging code
Hi, This patch removes some debugging code leftover from development. It's probably not helpful to keep it around now. OK for gomp4 branch? Thanks, Julian ChangeLog libgomp/ * oacc-host.c (DEBUG): Remove undefine. * plugin-nvptx.c (DEBUG, DISABLE_ASYNC): Remove comment-out macro definitions. * target.c (dump_mappings): Remove debugging function.commit 13794d26fc95225268e05abf9912ab6eba3c7b3f Author: Julian Brown jul...@codesourcery.com Date: Tue Oct 28 06:49:19 2014 -0700 Remove stray debugging code diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c index 6fe8f6c..fc3e77c 100644 --- a/libgomp/oacc-host.c +++ b/libgomp/oacc-host.c @@ -45,8 +45,6 @@ #include string.h #include stdio.h -#undef DEBUG - #ifdef HOST_NONSHM_PLUGIN #define STATIC #define GOMP(X) GOMP_PLUGIN_##X diff --git a/libgomp/plugin-nvptx.c b/libgomp/plugin-nvptx.c index c5bdf73..8d040fe 100644 --- a/libgomp/plugin-nvptx.c +++ b/libgomp/plugin-nvptx.c @@ -30,9 +30,6 @@ is not clear as to what that state might be. Or how one might propagate it from one thread to another. */ -//#define DEBUG -//#define DISABLE_ASYNC - #include openacc.h #include config.h #include libgomp.h diff --git a/libgomp/target.c b/libgomp/target.c index bce8ca6..73a186b 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -110,34 +110,6 @@ resolve_device (int device_id) return devices[device_id]; } -__attribute__((used)) static void -dump_mappings (FILE *f, splay_tree_node node) -{ - int i; - - splay_tree_key k = node-key; - - if (!k) -return; - - fprintf (f, key %p: host_start %p, host_end %p, tgt_offset %p, refcount %d, - copy_from %s\n, k, (void *) k-host_start, - (void *) k-host_end, (void *) k-tgt_offset, (int) k-refcount, - k-copy_from ? true : false); - fprintf (f, tgt-refcount %d, tgt-tgt_start %p, tgt-tgt_end %p, - tgt-to_free %p, tgt-prev %p, tgt-list_count %d, - tgt-device_descr %p\n, (int) k-tgt-refcount, - (void *) k-tgt-tgt_start, (void *) k-tgt-tgt_end, - k-tgt-to_free, k-tgt-prev, (int) k-tgt-list_count, - k-tgt-device_descr); - - for (i = 0; i k-tgt-list_count; i++) -fprintf (f, item %d: %p\n, i, k-tgt-list[i]); - - dump_mappings (f, node-left); - dump_mappings (f, node-right); -} - /* Handle the case where splay_tree_lookup found oldn for newn. Helper function of gomp_map_vars. */
[gomp4] Remove gomp_map_vars mem_map argument
Hi, This patch removes the now-redundant gomp_memory_mapping argument from gomp_map_vars, introduced when OpenACC kept the structure in question in a different place from OpenMP. Both now keep the memory map in the gomp_device_descr structure, so there's no need to pass both that and the memory map to the function explicitly. OK for gomp4 branch? Thanks, Julian ChangeLog libgomp/ * target.c (gomp_map_vars): Remove MM argument. (GOMP_target, GOMP_target_data): Update calls to gomp_map_vars. * oacc-mem.c (acc_map_data, present_create_copy): Update calls to gomp_map_vars. * oacc-parallel.c (GOACC_parallel, GOACC_data_start): Likewise. * target.h (gomp_map_vars): Update prototype.commit 3afc4e592a6d8a796ec0c44bb8dc808b1392fd29 Author: Julian Brown jul...@codesourcery.com Date: Tue Oct 28 09:17:01 2014 -0700 Remove gomp_map_vars mem_map argument diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c index d812f72..582a1e0 100644 --- a/libgomp/oacc-mem.c +++ b/libgomp/oacc-mem.c @@ -257,7 +257,7 @@ acc_map_data (void *h, void *d, size_t s) if (d != h) gomp_fatal (cannot map data on shared-memory system); - tgt = gomp_map_vars (NULL, NULL, 0, NULL, NULL, NULL, NULL, true, false); + tgt = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, true, false); } else { @@ -275,9 +275,8 @@ acc_map_data (void *h, void *d, size_t s) gomp_fatal (device address [%p, +%d] is already mapped, (void *)d, (int)s); - tgt = gomp_map_vars ((struct gomp_device_descr *) acc_dev, - acc_dev-mem_map, mapnum, hostaddrs, - devaddrs, sizes, kinds, true, false); + tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, devaddrs, sizes, + kinds, true, false); } tgt-prev = acc_dev-openacc.data_environ; @@ -383,9 +382,8 @@ present_create_copy (unsigned f, void *h, size_t s) else kinds = GOMP_MAP_ALLOC; - tgt = gomp_map_vars ((struct gomp_device_descr *) acc_dev, - acc_dev-mem_map, mapnum, hostaddrs, - NULL, s, kinds, true, false); + tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, s, kinds, true, + false); gomp_mutex_lock (acc_dev-mem_map.lock); diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c index b787df7..1639244 100644 --- a/libgomp/oacc-parallel.c +++ b/libgomp/oacc-parallel.c @@ -173,9 +173,8 @@ GOACC_parallel (int device, void (*fn) (void *), const void *openmp_target, else tgt_fn = (void (*)) fn; - tgt = gomp_map_vars ((struct gomp_device_descr *) acc_dev, - acc_dev-mem_map, mapnum, hostaddrs, - NULL, sizes, kinds, true, false); + tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true, + false); devaddrs = alloca (sizeof (void *) * mapnum); for (i = 0; i mapnum; i++) @@ -217,7 +216,7 @@ GOACC_data_start (int device, const void *openmp_target, size_t mapnum, if ((acc_dev-capabilities TARGET_CAP_SHARED_MEM) || !if_clause_condition_value) { - tgt = gomp_map_vars (NULL, NULL, 0, NULL, NULL, NULL, NULL, true, false); + tgt = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, true, false); tgt-prev = thr-mapped_data; thr-mapped_data = tgt; @@ -225,9 +224,8 @@ GOACC_data_start (int device, const void *openmp_target, size_t mapnum, } gomp_notify ( %s: prepare mappings\n, __FUNCTION__); - tgt = gomp_map_vars ((struct gomp_device_descr *) acc_dev, - acc_dev-mem_map, mapnum, hostaddrs, - NULL, sizes, kinds, true, false); + tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true, + false); gomp_notify ( %s: mappings prepared\n, __FUNCTION__); tgt-prev = thr-mapped_data; thr-mapped_data = tgt; diff --git a/libgomp/target.c b/libgomp/target.c index 615ba6b..507488e 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -134,14 +134,14 @@ get_kind (bool is_openacc, void *kinds, int idx) } attribute_hidden struct target_mem_desc * -gomp_map_vars (struct gomp_device_descr *devicep, - struct gomp_memory_mapping *mm, size_t mapnum, - void **hostaddrs, void **devaddrs, size_t *sizes, - void *kinds, bool is_openacc, bool is_target) +gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, + void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds, + bool is_openacc, bool is_target) { size_t i, tgt_align, tgt_size, not_found_cnt = 0; const int rshift = is_openacc ? 8 : 3; const int typemask = is_openacc ? 0xff : 0x7; + struct gomp_memory_mapping *mm = devicep-mem_map; struct splay_tree_key_s cur_node; struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt) + sizeof (tgt-list[0]) * mapnum); @@ -861,8 +861,8 @@ GOMP_target (int device, void (*fn) (void *), const void *openmp_target, gomp_mutex_unlock (mm-lock); struct target_mem_desc *tgt_vars -= gomp_map_vars (devicep, devicep-mem_map, mapnum, hostaddrs, NULL
Re: [gomp4] Remove gomp_map_vars mem_map argument
On Tue, 28 Oct 2014 16:52:22 + Julian Brown jul...@codesourcery.com wrote: Hi, This patch removes the now-redundant gomp_memory_mapping argument from gomp_map_vars, introduced when OpenACC kept the structure in question in a different place from OpenMP. Both now keep the memory map in the gomp_device_descr structure, so there's no need to pass both that and the memory map to the function explicitly. OK for gomp4 branch? Forgot to say: this patch and the previous three have been tested with no regressions (alongside a version of Bernd's PTX support patches) on the gomp4 branch (libgomp tests). Julian
[gomp4] Use GOMP_PLUGIN_ not gomp_plugin_ for libgomp plugin API
Hi, As the title says, this patch makes the libgomp plugin API use the GOMP_PLUGIN_ prefix rather than gomp_plugin_. This is purely a mechanical change. OK for the gomp4 branch? Thanks, Julian ChangeLog libgomp/ * libgomp-plugin.c (gomp_plugin_*): Rename to... (GOMP_PLUGIN_*): This. * libgomp-plugin.h: Likewise. * libgomp.map: Likewise. * oacc-host.c (GOMP): Use GOMP_PLUGIN_ in macro expansion. * oacc-plugin.c (gomp_plugin_*): Rename to... (GOMP_PLUGIN_*): This. * plugin-nvptx.c: Likewise.commit cce63ddb8895d3b51a176d68045b7920affc05e5 Author: Julian Brown jul...@codesourcery.com Date: Wed Oct 15 02:05:08 2014 -0700 Use GOMP_PLUGIN_ not gomp_plugin_ for libgomp plugin API. diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c index 46dd7b0..0f72bb9 100644 --- a/libgomp/libgomp-plugin.c +++ b/libgomp/libgomp-plugin.c @@ -31,25 +31,25 @@ #include target.h void * -gomp_plugin_malloc (size_t size) +GOMP_PLUGIN_malloc (size_t size) { return gomp_malloc (size); } void * -gomp_plugin_malloc_cleared (size_t size) +GOMP_PLUGIN_malloc_cleared (size_t size) { return gomp_malloc_cleared (size); } void * -gomp_plugin_realloc (void *ptr, size_t size) +GOMP_PLUGIN_realloc (void *ptr, size_t size) { return gomp_realloc (ptr, size); } void -gomp_plugin_error (const char *msg, ...) +GOMP_PLUGIN_error (const char *msg, ...) { va_list ap; @@ -59,7 +59,7 @@ gomp_plugin_error (const char *msg, ...) } void -gomp_plugin_notify (const char *msg, ...) +GOMP_PLUGIN_notify (const char *msg, ...) { va_list ap; @@ -69,7 +69,7 @@ gomp_plugin_notify (const char *msg, ...) } void -gomp_plugin_fatal (const char *msg, ...) +GOMP_PLUGIN_fatal (const char *msg, ...) { va_list ap; @@ -82,25 +82,25 @@ gomp_plugin_fatal (const char *msg, ...) } void -gomp_plugin_mutex_init (gomp_mutex_t *mutex) +GOMP_PLUGIN_mutex_init (gomp_mutex_t *mutex) { gomp_mutex_init (mutex); } void -gomp_plugin_mutex_destroy (gomp_mutex_t *mutex) +GOMP_PLUGIN_mutex_destroy (gomp_mutex_t *mutex) { gomp_mutex_destroy (mutex); } void -gomp_plugin_mutex_lock (gomp_mutex_t *mutex) +GOMP_PLUGIN_mutex_lock (gomp_mutex_t *mutex) { gomp_mutex_lock (mutex); } void -gomp_plugin_mutex_unlock (gomp_mutex_t *mutex) +GOMP_PLUGIN_mutex_unlock (gomp_mutex_t *mutex) { gomp_mutex_unlock (mutex); } diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h index 0ecb407..e31573c 100644 --- a/libgomp/libgomp-plugin.h +++ b/libgomp/libgomp-plugin.h @@ -31,27 +31,27 @@ /* alloc.c */ -extern void *gomp_plugin_malloc (size_t) __attribute__((malloc)); -extern void *gomp_plugin_malloc_cleared (size_t) __attribute__((malloc)); -extern void *gomp_plugin_realloc (void *, size_t); +extern void *GOMP_PLUGIN_malloc (size_t) __attribute__((malloc)); +extern void *GOMP_PLUGIN_malloc_cleared (size_t) __attribute__((malloc)); +extern void *GOMP_PLUGIN_realloc (void *, size_t); /* error.c */ -extern void gomp_plugin_notify(const char *msg, ...); -extern void gomp_plugin_error (const char *, ...) +extern void GOMP_PLUGIN_notify(const char *msg, ...); +extern void GOMP_PLUGIN_error (const char *, ...) __attribute__((format (printf, 1, 2))); -extern void gomp_plugin_fatal (const char *, ...) +extern void GOMP_PLUGIN_fatal (const char *, ...) __attribute__((noreturn, format (printf, 1, 2))); /* mutex.c */ -extern void gomp_plugin_mutex_init (gomp_mutex_t *mutex); -extern void gomp_plugin_mutex_destroy (gomp_mutex_t *mutex); -extern void gomp_plugin_mutex_lock (gomp_mutex_t *mutex); -extern void gomp_plugin_mutex_unlock (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_init (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_destroy (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_lock (gomp_mutex_t *mutex); +extern void GOMP_PLUGIN_mutex_unlock (gomp_mutex_t *mutex); /* target.c */ -extern void gomp_plugin_async_unmap_vars (void *ptr); +extern void GOMP_PLUGIN_async_unmap_vars (void *ptr); #endif diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index e1e87d9..538aabb 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -326,15 +326,15 @@ GOACC_2.0 { # FIXME: Hygiene/grouping/naming? PLUGIN_1.0 { global: - gomp_plugin_malloc; - gomp_plugin_malloc_cleared; - gomp_plugin_realloc; - gomp_plugin_error; - gomp_plugin_notify; - gomp_plugin_fatal; - gomp_plugin_mutex_init; - gomp_plugin_mutex_destroy; - gomp_plugin_mutex_lock; - gomp_plugin_mutex_unlock; - gomp_plugin_async_unmap_vars; + GOMP_PLUGIN_malloc; + GOMP_PLUGIN_malloc_cleared; + GOMP_PLUGIN_realloc; + GOMP_PLUGIN_error; + GOMP_PLUGIN_notify; + GOMP_PLUGIN_fatal; + GOMP_PLUGIN_mutex_init; + GOMP_PLUGIN_mutex_destroy; + GOMP_PLUGIN_mutex_lock; + GOMP_PLUGIN_mutex_unlock; + GOMP_PLUGIN_async_unmap_vars; }; diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c index 7a50d65..a47617a 100644 --- a/libgomp/oacc-host.c +++ b/libgomp/oacc-host.c
[gomp4] Fix include path configury for gomp-constants.h
Hi, This patch tweaks the include path configury used by libgomp to find the gomp-constants.h header, as suggested by Jakub. OK for the gomp4 branch? Thanks, Julian libgomp/ * Makefile.am (AM_CPPFLAGS): Fix search path for locating gomp-constants.h. * Makefile.in: Regenerate.commit a682a91d68d3ffb1516a1589ef093e00151a6078 Author: Julian Brown jul...@codesourcery.com Date: Wed Oct 15 02:12:07 2014 -0700 Fix include path configury for gomp-constants.h. diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am index 7ddb0a4..77f71ee 100644 --- a/libgomp/Makefile.am +++ b/libgomp/Makefile.am @@ -14,8 +14,7 @@ libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include vpath % $(strip $(search_path)) -AM_CPPFLAGS = $(addprefix -I, $(search_path)) \ - $(addprefix -I, $(search_path)/../include) +AM_CPPFLAGS = $(addprefix -I, $(search_path)) -I $(top_srcdir)/../include AM_CFLAGS = $(XCFLAGS) AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS) diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in index 4965442..fdd18ff 100644 --- a/libgomp/Makefile.in +++ b/libgomp/Makefile.in @@ -333,9 +333,7 @@ gcc_version := $(shell cat $(top_srcdir)/../gcc/BASE-VER) search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/finclude libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include -AM_CPPFLAGS = $(addprefix -I, $(search_path)) \ - $(addprefix -I, $(search_path)/../include) - +AM_CPPFLAGS = $(addprefix -I, $(search_path)) -I $(top_srcdir)/../include AM_CFLAGS = $(XCFLAGS) AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS) toolexeclib_LTLIBRARIES = libgomp.la $(am__append_1) \
[gomp4] Asynchronous data unmapping wait fixes for OpenACC
Hi, This patch introduces a new plugin hook in libgomp to register a callback function to clean up host-side bookkeeping data after an asynchronous operation has completed (replacing the previous ad-hoc method used in the NVPTX backend), and adds code to ensure that same cleanup is done reliably in the NVPTX backend when the user program hits a wait directive, or equivalent. OK for the gomp4 branch? Thanks, Julian ChangeLog libgomp/ * oacc-host.c (openacc_register_async_cleanup): New. (host_dispatch): Initialise register_async_cleanup_func entry. * oacc-int.h (struct ACC_dispatch_t): Add register_async_cleanup_func hook. * oacc-parallel.c (GOACC_parallel): Call register_async_cleanup_func hook after queuing asynchronous copy-back. * plugin-nvptx.c (enum PTX_event_type): Add PTX_EVT_ASYNC_CLEANUP. (struct PTX_event): Remove tgt field. (event_gc): Don't do async cleanup in PTX_EVT_KNL, do it in PTX_EVT_ASYNC_CLEANUP instead. (event_add): Remove tgt argument. Support PTX_EVT_ASYNC_CLEANUP events. (PTX_exec, PTX_host2dev, PTX_dev2host, PTX_wait_async) (PTX_wait_all_async): Update calls to event_add. (openacc_register_async_cleanup): New. (PTX_async_test): Call event_gc on success path. (PTX_async_test_all): Likewise. * target.c (gomp_load_plugin_for_device): Initialise register_async_cleanup hook. commit 78d6b16bf258106282f791f2e7b3010bf75f2a86 Author: Julian Brown jul...@codesourcery.com Date: Wed Oct 15 02:10:00 2014 -0700 Async fixes/improvements. diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c index a47617a..f44ca5e 100644 --- a/libgomp/oacc-host.c +++ b/libgomp/oacc-host.c @@ -294,6 +294,16 @@ openacc_parallel (void (*fn) (void *), size_t mapnum __attribute__((unused)), } STATIC void +openacc_register_async_cleanup (void *targ_mem_desc) +{ +#ifdef HOST_NONSHM_PLUGIN + /* Asynchronous launches are executed synchronously on the (non-SHM) host, + so there's no point in delaying host-side cleanup -- just do it now. */ + GOMP_PLUGIN_async_unmap_vars (targ_mem_desc); +#endif +} + +STATIC void openacc_async_set_async (int async __attribute__((unused))) { #ifdef DEBUG @@ -397,6 +407,8 @@ static struct gomp_device_descr host_dispatch = .exec_func = openacc_parallel, + .register_async_cleanup_func = openacc_register_async_cleanup, + .async_set_async_func = openacc_async_set_async, .async_test_func = openacc_async_test, .async_test_all_func = openacc_async_test_all, diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h index e1d2e32..03529cc 100644 --- a/libgomp/oacc-int.h +++ b/libgomp/oacc-int.h @@ -64,6 +64,9 @@ typedef struct ACC_dispatch_t void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *, unsigned short *, int, int, int, int, void *); + /* async cleanup callback registration */ + void (*register_async_cleanup_func) (void *); + /* asynchronous routines */ int (*async_test_func) (int); int (*async_test_all_func) (void); diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c index 57ac8de..e3f156c 100644 --- a/libgomp/oacc-parallel.c +++ b/libgomp/oacc-parallel.c @@ -213,7 +213,10 @@ GOACC_parallel (int device, void (*fn) (void *), const void *openmp_target, if (async acc_async_noval) gomp_unmap_vars (tgt, true); else -gomp_copy_from_async (tgt); +{ + gomp_copy_from_async (tgt); + ACC_dev-openacc.register_async_cleanup_func (tgt); +} ACC_dev-openacc.async_set_async_func (acc_async_sync); } diff --git a/libgomp/plugin-nvptx.c b/libgomp/plugin-nvptx.c index e163f3a..f193229 100644 --- a/libgomp/plugin-nvptx.c +++ b/libgomp/plugin-nvptx.c @@ -317,7 +317,8 @@ enum PTX_event_type { PTX_EVT_MEM, PTX_EVT_KNL, - PTX_EVT_SYNC + PTX_EVT_SYNC, + PTX_EVT_ASYNC_CLEANUP }; struct PTX_event @@ -325,7 +326,6 @@ struct PTX_event CUevent *evt; int type; void *addr; - void *tgt; int ord; SLIST_ENTRY(PTX_event) next; }; @@ -946,6 +946,10 @@ event_gc (bool memmap_lockable) break; case PTX_EVT_KNL: + map_pop (ptx_event-addr); + break; + + case PTX_EVT_ASYNC_CLEANUP: { /* The function GOMP_PLUGIN_async_unmap_vars needs to claim the memory-map splay tree lock for the current device, so we @@ -955,9 +959,7 @@ event_gc (bool memmap_lockable) if (!memmap_lockable) goto next_event; - map_pop (ptx_event-addr); - if (ptx_event-tgt) - GOMP_PLUGIN_async_unmap_vars (ptx_event-tgt); + GOMP_PLUGIN_async_unmap_vars (ptx_event-addr); } break; } @@ -978,17 +980,17 @@ event_gc (bool memmap_lockable) } static void -event_add (enum PTX_event_type type, CUevent *e, void *h, void *tgt) +event_add (enum PTX_event_type type, CUevent *e, void *h) { struct PTX_event *ptx_event; - assert (type == PTX_EVT_MEM || type == PTX_EVT_KNL || type
[gomp] [3/3] OpenACC 2.0 support for libgomp - documentation
This is a version of the patch: https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02024.html against gomp4 branch instead of mainline. OK to apply? Thanks, Julian -xx-xx Thomas Schwinge tho...@codesourcery.com James Norris jnor...@codesourcery.com libgomp/ * libgomp.texi: Outline documentation for OpenACC. From c58006a7ade2a9556bd73bac9ef45b3bbd62ca37 Mon Sep 17 00:00:00 2001 From: Julian Brown jul...@codesourcery.com Date: Wed, 17 Sep 2014 10:26:56 -0700 Subject: [PATCH 2/3] OpenACC documentation --- libgomp/libgomp.texi | 661 -- 1 file changed, 636 insertions(+), 25 deletions(-) diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 254be57..9530a2b 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -31,10 +31,12 @@ texts being (a) (see below), and with the Back-Cover Texts being (b) @ifinfo @dircategory GNU Libraries @direntry -* libgomp: (libgomp).GNU OpenMP runtime library +* libgomp: (libgomp).GNU OpenACC and OpenMP runtime library @end direntry -This manual documents the GNU implementation of the OpenMP API for +This manual documents the GNU implementation of the OpenACC API for +offloading of code to accelerator devices in C/C++ and Fortran and +the GNU implementation of the OpenMP API for multi-platform shared-memory parallel programming in C/C++ and Fortran. Published by the Free Software Foundation @@ -48,7 +50,7 @@ Boston, MA 02110-1301 USA @setchapternewpage odd @titlepage -@title The GNU OpenMP Implementation +@title The GNU OpenACC and OpenMP Implementation @page @vskip 0pt plus 1filll @comment For the @value{version-GCC} Version* @@ -69,7 +71,10 @@ Boston, MA 02110-1301, USA@* @top Introduction @cindex Introduction -This manual documents the usage of libgomp, the GNU implementation of the +This manual documents the usage of libgomp, the GNU implementation of the +@uref{http://www.openacc.org/, OpenACC} Application Programming Interface (API) +for offloading of code to accelerator devices in C/C++ and Fortran, and +the GNU implementation of the @uref{http://www.openmp.org, OpenMP} Application Programming Interface (API) for multi-platform shared-memory parallel programming in C/C++ and Fortran. @@ -81,23 +86,619 @@ for multi-platform shared-memory parallel programming in C/C++ and Fortran. @comment better formatting. @comment @menu -* Enabling OpenMP::How to enable OpenMP for your applications. -* Runtime Library Routines:: The OpenMP runtime application programming - interface. -* Environment Variables:: Influencing runtime behavior with environment - variables. -* The libgomp ABI::Notes on the external ABI presented by libgomp. -* Reporting Bugs:: How to report bugs in GNU OpenMP. -* Copying::GNU general public license says - how you can copy and share libgomp. -* GNU Free Documentation License:: - How you can copy and share this manual. -* Funding::How to help assure continued work for free - software. -* Library Index:: Index of this documentation. +* Enabling OpenACC:: How to enable OpenACC for your + applications. +* OpenACC Runtime Library Routines:: The OpenACC runtime application + programming interface. +* OpenACC Environment Variables::Influencing OpenACC runtime behavior with + environment variables. +* OpenACC Library Interoperability:: OpenACC library interoperability with the + NVIDIA CUBLAS library. +* Enabling OpenMP:: How to enable OpenMP for your + applications. +* OpenMP Runtime Library Routines: Runtime Library Routines. + The OpenMP runtime application programming + interface. +* OpenMP Environment Variables: Environment Variables. + Influencing OpenMP runtime behavior with + environment variables. +* The libgomp ABI:: Notes on the external libgomp ABI. +* Reporting Bugs:: How to report bugs. +* Copying:: GNU general public license says how you + can copy and share libgomp. +* GNU Free Documentation License:: How you can copy and share this manual. +* Funding:: How to help assure continued work for free + software. +* Library Index::Index of this documentation. @end menu + +@c
Re: [PATCH 0/10] OpenACC 2.0 support for libgomp
Hi, On Wed, 24 Sep 2014 14:32:31 +0200 Jakub Jelinek ja...@redhat.com wrote: On Tue, Sep 23, 2014 at 07:17:25PM +0100, Julian Brown wrote: The upcoming patch series constitutes our current (still in-progress) implementation of run-time support for OpenACC 2.0 in libgomp. We've tried to build on top of the (also currently WIP) support for OpenMP 4.0's target construct, sharing code where possible: because of this, I've also prepared versions of (a fairly minimal, hopefully correct set of) prerequisite patches that apply to current mainline (and were previously on the gomp 4.0 branch), although in many cases we weren't the original authors of those. Other parts of the OpenACC support for GCC are being sent upstream concurrently with this runtime support (and are co-dependent with it), so unfortunately, though the main part of the implementation (part 7/10) works on our internal branch, I haven't yet been able to convincingly test the series I'm about to post upstream. However this code will be useful to others who are posting their bits of OpenACC support upstream, so perhaps it'd be useful to commit it anyway (we have to start somewhere!). Just random comments about all the 10 patches: Thanks for your comments -- I'm planning to address the things you've bought up, but will probably change tack a little and do that work on the gomp-4_0-branch (rather than working directly on mainline). That way I can (hopefully) send incremental patches rather than working entirely locally then sending another over-sized patch. Cache the return value? Also, I must say I'm not particularly excited about different plugins not supporting both OpenMP 4.0 and OpenACC 2.0 offloading. Why is that needed? For now, because OpenACC supports some stuff that (AFAIK!) OpenMP doesn't, such as asynchronous execution. The eventual plan is for the plugin interface to be generic, but we're not there yet. + /* Make sure all the CUDA functions are there if any of them are. */ + if (optional_present optional_present != optional_total) + { + err = plugin missing OpenACC CUDA handler function; + goto out; + } So, any plugin that doesn't support CUDA will not support OpenACC? I hoped OpenACC would not be so tied to one particular HW... The intention was for that section to allow zero CUDA handling functions, or all of them. For better or worse, OpenACC defines a few APIs which are target-specific (for NVidia, AMD, Intel so far, IIRC). An OpenACC application doesn't have to use any of those, of course. that is not how ChangeLog entries should look like, if a line is not starting with ( after the tab, it should not contain extra spaces after the tab, so move Use these. and hack. (and in other spots) two columns to the left. That was merely a copy/paste error of some sort, apologies. Thanks, Julian
[PATCH 1/10] OpenACC 2.0 support for libgomp - offloading support
This patch is by Jakub Jelinek, and was originally posted here: https://gcc.gnu.org/ml/gcc-patches/2013-09/msg01098.html Parts of the patch subsequently landed on mainline as part of the following patch: https://gcc.gnu.org/ml/gcc-patches/2013-10/msg00505.html But not the OpenMP target parts. This patch therefore contains the delta between those two patches. Julian -xx-xx Jakub Jelinek ja...@redhat.com libgomp/ * splay-tree.h: New file. * target.c (splay_tree_node, splay_tree, splay_tree_key): New typedefs. (struct target_mem_desc, struct splay_tree_key_s): New structures. (splay_compare): New inline function. * libgomp.h (gomp_get_num_devices): Add prototype. (gomp_get_num_devices): Add FIXME comment. (resolve_device): Use default_device_var ICV. Add temporarily magic testing device number 257. (dev_splay_tree, dev_env_lock): New variables. (gomp_map_vars_existing, gomp_map_vars, gomp_unmap_tgt, gomp_unmap_vars, gomp_update): New functions. (GOMP_target, GOMP_target_data, GOMP_target_end_data, GOMP_target_update): Add support for magic testing device number 257. commit fc39aa98eba906466226c17fb455e57ebcfc1bc6 Author: Julian Brown jul...@codesourcery.com Date: Fri Sep 19 08:33:05 2014 -0700 Delta between upstream and gomp-4_0-branch version of r202620: 2013-09-16 Jakub Jelinek ja...@redhat.com * splay-tree.h: New file. * target.c: Include stdbool.h. (splay_tree_node, splay_tree, splay_tree_key): New typedefs. (struct target_mem_desc, struct splay_tree_key_s): New structures. (splay_compare): New inline function. (gomp_get_num_devices): New function. (resolve_device): Use default_device_var ICV. Add temporarily magic testing device number 257. (dev_splay_tree, dev_env_lock): New variables. (gomp_map_vars_existing, gomp_map_vars, gomp_unmap_tgt, gomp_unmap_vars, gomp_update): New functions. (GOMP_target, GOMP_target_data, GOMP_target_end_data, GOMP_target_update): Add support for magic testing device number 257. * libgomp.h (struct target_mem_desc): Forward declare. (struct gomp_task_icv): Add default_device_var and target_data. (gomp_get_num_devices): New prototype. * env.c (gomp_global_icv): Add default_device_var initializer. (parse_int): New function. (handle_omp_display_env): Print OMP_DEFAULT_DEVICE. (initialize_env): Initialize default_device_var. (omp_set_default_device): Set default_device_var ICV. (omp_get_default_device): Query default_device_var ICV. (omp_get_num_devices): Call gomp_get_num_devices. (omp_get_num_teams, omp_get_team_num, omp_is_initial_device): Add comments. diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a1482cc..d53a326 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -608,6 +608,10 @@ extern void gomp_free_thread (void *); extern int gomp_get_num_devices (void); +/* target.c */ + +extern int gomp_get_num_devices (void); + /* work.c */ extern void gomp_init_work_share (struct gomp_work_share *, bool, unsigned); diff --git a/libgomp/splay-tree.h b/libgomp/splay-tree.h new file mode 100644 index 000..04a71d1 --- /dev/null +++ b/libgomp/splay-tree.h @@ -0,0 +1,232 @@ +/* A splay-tree datatype. + Copyright 1998-2013 + Free Software Foundation, Inc. + Contributed by Mark Mitchell (m...@markmitchell.com). + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + http://www.gnu.org/licenses/. */ + +/* The splay tree code copied from include/splay-tree.h and adjusted, + so that all the data lives directly in splay_tree_node_s structure + and no extra allocations are needed. + + Files including this header should before including it add: +typedef struct splay_tree_node_s *splay_tree_node; +typedef struct splay_tree_s
[PATCH 0/10] OpenACC 2.0 support for libgomp
Hi, The upcoming patch series constitutes our current (still in-progress) implementation of run-time support for OpenACC 2.0 in libgomp. We've tried to build on top of the (also currently WIP) support for OpenMP 4.0's target construct, sharing code where possible: because of this, I've also prepared versions of (a fairly minimal, hopefully correct set of) prerequisite patches that apply to current mainline (and were previously on the gomp 4.0 branch), although in many cases we weren't the original authors of those. Other parts of the OpenACC support for GCC are being sent upstream concurrently with this runtime support (and are co-dependent with it), so unfortunately, though the main part of the implementation (part 7/10) works on our internal branch, I haven't yet been able to convincingly test the series I'm about to post upstream. However this code will be useful to others who are posting their bits of OpenACC support upstream, so perhaps it'd be useful to commit it anyway (we have to start somewhere!). I've tried to retain proper attribution for all the forthcoming patches, but I may have made mistakes. Please let me know if so! Thanks, Julian
[PATCH 2/10] OpenACC 2.0 support for libgomp - initial plugin support
This patch is by Michael Zolotukhin and was originally posted here: https://gcc.gnu.org/ml/gcc-patches/2013-09/msg01469.html It contains an initial implementation of plugin support for libgomp, for implementing different hardware devices for pieces of accelerated code to be offloaded to. I also merged a minor follow-up fix by Thomas Schwinge. Julian -xx-xx Michael Zolotukhin michael.v.zolotuk...@intel.com Thomas Schwinge tho...@codesourcery.com * configure.ac: Add checks for plugins support. * config.h.in: Regenerated. * configure: Regenerated. * target.c (struct target_mem_desc): Add device_descr field. (devices): New. (num_devices): New. (struct gomp_device_descr): New. (gomp_get_num_devices): Call gomp_target_init. (resolve_device): Return device_descr instead of int. (gomp_map_vars): Add devicep argument and update the function accordingly. (gomp_unmap_tgt): Likewise. (gomp_unmap_vars): Likewise. (gomp_update): Likewise. (GOMP_target): Use device_descr struct. (GOMP_target_data): Likewise. (GOMP_target_update): Likewise. (gomp_check_plugin_file_name): New. (gomp_load_plugin_for_device): New. (gomp_find_available_plugins): New. (gomp_target_init): New. commit 75ef137a74cbd6af36a75b30edf60350ec9eae0d Author: Julian Brown jul...@codesourcery.com Date: Fri Sep 19 08:51:44 2014 -0700 Merge of r202827. -xx-xx Michael Zolotukhin michael.v.zolotuk...@intel.com Thomas Schwinge tho...@codesourcery.com * configure.ac: Add checks for plugins support. * config.h.in: Regenerated. * configure: Regenerated. * target.c (struct target_mem_desc): Add device_descr field. (devices): New. (num_devices): New. (struct gomp_device_descr): New. (gomp_get_num_devices): Call gomp_target_init. (resolve_device): Return device_descr instead of int. (gomp_map_vars): Add devicep argument and update the function accordingly. (gomp_unmap_tgt): Likewise. (gomp_unmap_vars): Likewise. (gomp_update): Likewise. (GOMP_target): Use device_descr struct. (GOMP_target_data): Likewise. (GOMP_target_update): Likewise. (gomp_check_plugin_file_name): New. (gomp_load_plugin_for_device): New. (gomp_find_available_plugins): New. (gomp_target_init): New. diff --git a/libgomp/config.h.in b/libgomp/config.h.in index 14c7e2a..67f5420 100644 --- a/libgomp/config.h.in +++ b/libgomp/config.h.in @@ -30,6 +30,9 @@ /* Define to 1 if you have the inttypes.h header file. */ #undef HAVE_INTTYPES_H +/* Define to 1 if you have the `dl' library (-ldl). */ +#undef HAVE_LIBDL + /* Define to 1 if you have the memory.h header file. */ #undef HAVE_MEMORY_H @@ -107,6 +110,9 @@ /* Define to the version of this package. */ #undef PACKAGE_VERSION +/* Define if all infrastructure, needed for plugins, is supported. */ +#undef PLUGIN_SUPPORT + /* The size of `char', as computed by sizeof. */ #undef SIZEOF_CHAR diff --git a/libgomp/configure b/libgomp/configure index 766eb09..704f22a 100755 --- a/libgomp/configure +++ b/libgomp/configure @@ -15052,6 +15052,69 @@ fi rm -f core conftest.err conftest.$ac_objext \ conftest$ac_exeext conftest.$ac_ext +plugin_support=yes +{ $as_echo $as_me:${as_lineno-$LINENO}: checking for dlsym in -ldl 5 +$as_echo_n checking for dlsym in -ldl... 6; } +if test ${ac_cv_lib_dl_dlsym+set} = set; then : + $as_echo_n (cached) 6 +else + ac_check_lib_save_LIBS=$LIBS +LIBS=-ldl $LIBS +cat confdefs.h - _ACEOF conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern C +#endif +char dlsym (); +int +main () +{ +return dlsym (); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link $LINENO; then : + ac_cv_lib_dl_dlsym=yes +else + ac_cv_lib_dl_dlsym=no +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +LIBS=$ac_check_lib_save_LIBS +fi +{ $as_echo $as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_dl_dlsym 5 +$as_echo $ac_cv_lib_dl_dlsym 6; } +if test x$ac_cv_lib_dl_dlsym = xyes; then : + cat confdefs.h _ACEOF +#define HAVE_LIBDL 1 +_ACEOF + + LIBS=-ldl $LIBS + +else + plugin_support=no +fi + +ac_fn_c_check_header_mongrel $LINENO dirent.h ac_cv_header_dirent_h $ac_includes_default +if test x$ac_cv_header_dirent_h = xyes; then : + +else + plugin_support=no +fi + + + +if test x$plugin_support = xyes; then + +$as_echo #define PLUGIN_SUPPORT 1 confdefs.h + +fi + # Check for functions needed. for ac_func in getloadavg clock_gettime strtoull do : diff --git a/libgomp
[PATCH 4/10] OpenACC 2.0 support for libgomp - host plugin
This patch was originally by Thomas Schwinge and was posted here: https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01172.html It implements a plugin for host execution that can be used for testing non-shared-memory semantics on a virtual target device. It's merged with a minor follow-up patch, also by Thomas. Julian -xx-xx Thomas Schwinge tho...@codesourcery.com James Norris jnor...@codesourcery.com * plugin-host.c: New file. * target.c (struct gomp_device_descr): Add device_alloc_func, device_free_func, device_dev2host_func, device_host2dev_func members. (gomp_load_plugin_for_device): Load these. (gomp_map_vars, gomp_unmap_tgt, gomp_unmap_vars, gomp_update): Use these. (resolve_device, gomp_find_available_plugins): Remove ID 257 hack. commit 1adb683c08079789d013713751a15803b26f11c2 Author: Julian Brown jul...@codesourcery.com Date: Fri Sep 19 09:07:08 2014 -0700 Merge r207938. 2014-02-20 Thomas Schwinge tho...@codesourcery.com James Norris jnor...@codesourcery.com * plugin-host.c: New file. * target.c (struct gomp_device_descr): Add device_alloc_func, device_free_func, device_dev2host_func, device_host2dev_func members. (gomp_load_plugin_for_device): Load these. (gomp_map_vars, gomp_unmap_tgt, gomp_unmap_vars, gomp_update): Use these. (resolve_device, gomp_find_available_plugins): Remove ID 257 hack. Merge r207940. 2014-02-20 Thomas Schwinge tho...@codesourcery.com * target.c (gomp_load_plugin_for_device): Don't call dlcose if dlopen failed. diff --git a/libgomp/plugin-host.c b/libgomp/plugin-host.c new file mode 100644 index 000..5354ebe --- /dev/null +++ b/libgomp/plugin-host.c @@ -0,0 +1,84 @@ +/* Plugin for non-shared memory host execution. + + Copyright (C) 2014 Free Software Foundation, Inc. + + Contributed by Thomas Schwinge tho...@codesourcery.com. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + http://www.gnu.org/licenses/. */ + +/* Simple implementation of a libgomp plugin for non-shared memory host + execution. */ + +#include stdbool.h +#include stdio.h +#include stdlib.h +#include string.h + +bool +device_available (void) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s\n, __FILE__, __FUNCTION__); +#endif + + return true; +} + +void * +device_alloc (size_t size) +{ + void *ptr = malloc (size); + +#ifdef DEBUG + printf (libgomp plugin: %s:%s (%zd): %p\n, __FILE__, __FUNCTION__, size, ptr); +#endif + + return ptr; +} + +void +device_free (void *ptr) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s (%p)\n, __FILE__, __FUNCTION__, ptr); +#endif + + free (ptr); +} + +void *device_dev2host (void *dest, const void *src, size_t n) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s (%p, %p, %zd)\n, __FILE__, __FUNCTION__, dest, src, n); +#endif + + return memcpy (dest, src, n); +} + +void *device_host2dev (void *dest, const void *src, size_t n) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s (%p, %p, %zd)\n, __FILE__, __FUNCTION__, dest, src, n); +#endif + + return memcpy (dest, src, n); +} diff --git a/libgomp/target.c b/libgomp/target.c index f1e776b..d0db4c2 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -122,6 +122,10 @@ struct gomp_device_descr /* Function handlers. */ bool (*device_available_func) (void); + void *(*device_alloc_func) (size_t); + void (*device_free_func) (void *); + void *(*device_dev2host_func)(void *, const void *, size_t); + void *(*device_host2dev_func)(void *, const void *, size_t); /* Splay tree containing information about mapped memory regions. */ struct splay_tree_s dev_splay_tree; @@ -145,14 +149,10 @@ resolve_device (int device_id) struct gomp_task_icv *icv = gomp_icv (false); device_id = icv-default_device_var; } - if (device_id = gomp_get_num_devices () - device_id != 257) + if (device_id 0 + || device_id = gomp_get_num_devices
[PATCH 3/10] OpenACC 2.0 support for libgomp - Don't update copy_from for existing mappings
This patch is by Ilya Verbin and was originally posted here: https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01011.html This is a fix for OpenMP semantics re: mapping of memory for a target device. Julian -xx-xx Ilya Verbin ilya.ver...@intel.com * target.c (gomp_map_vars_existing): Don't update copy_from for the existing mappings. commit 76da6cdeb61190c6b39f02656a91a24e26bc3006 Author: Julian Brown jul...@codesourcery.com Date: Fri Sep 19 09:03:49 2014 -0700 Merge r207897. 2014-02-17 Ilya Verbin ilya.ver...@intel.com * target.c (gomp_map_vars_existing): Don't update copy_from for the existing mappings. diff --git a/libgomp/target.c b/libgomp/target.c index 55b3781..f1e776b 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -170,11 +170,6 @@ gomp_map_vars_existing (splay_tree_key oldn, splay_tree_key newn, [%p..%p) is already mapped, (void *) newn-host_start, (void *) newn-host_end, (void *) oldn-host_start, (void *) oldn-host_end); - if (((kind 7) == 2 || (kind 7) == 3) - !oldn-copy_from - oldn-host_start == newn-host_start - oldn-host_end == newn-host_end) -oldn-copy_from = true; oldn-refcount++; }
[PATCH 6/10] OpenACC 2.0 support for libgomp - Fortran bits
This patch is by Thomas Schwinge and Jakub Jelinek, and was originally posted here: https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00656.html It adds some mappings required by the OpenACC implementation for Fortran. Julian -xx-xx Thomas Schwinge tho...@codesourcery.com Jakub Jelinek ja...@redhat.com * target.c (gomp_map_vars, gomp_unmap_vars, gomp_update): Support NULL mappings as well as mapping kind OMP_CLAUSE_MAP_TO_PSET. Also, some code reformatting. commit b661af0d60506bf174b687dbd0a590bacd0a4ed4 Author: Julian Brown jul...@codesourcery.com Date: Fri Sep 19 09:18:03 2014 -0700 Merge r212405. 2014-07-09 Thomas Schwinge tho...@codesourcery.com Jakub Jelinek ja...@redhat.com * target.c (gomp_map_vars, gomp_unmap_vars, gomp_update): Support NULL mappings as well as mapping kind OMP_CLAUSE_MAP_TO_PSET. Also, some code reformatting. diff --git a/libgomp/target.c b/libgomp/target.c index ef62228..64b787e 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -238,6 +238,11 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, gomp_mutex_lock (devicep-dev_env_lock); for (i = 0; i mapnum; i++) { + if (hostaddrs[i] == NULL) + { + tgt-list[i] = NULL; + continue; + } cur_node.host_start = (uintptr_t) hostaddrs[i]; if ((kinds[i] 7) != 4) cur_node.host_end = cur_node.host_start + sizes[i]; @@ -259,6 +264,22 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, tgt_align = align; tgt_size = (tgt_size + align - 1) ~(align - 1); tgt_size += cur_node.host_end - cur_node.host_start; + if ((kinds[i] 7) == 5) + { + size_t j; + for (j = i + 1; j mapnum; j++) + if ((kinds[j] 7) != 4) + break; + else if ((uintptr_t) hostaddrs[j] cur_node.host_start + || ((uintptr_t) hostaddrs[j] + sizeof (void *) + cur_node.host_end)) + break; + else + { + tgt-list[j] = NULL; + i++; + } + } } } @@ -281,10 +302,13 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, { tgt-array = gomp_malloc (not_found_cnt * sizeof (*tgt-array)); splay_tree_node array = tgt-array; + size_t j; for (i = 0; i mapnum; i++) if (tgt-list[i] == NULL) { + if (hostaddrs[i] == NULL) + continue; splay_tree_key k = array-key; k-host_start = (uintptr_t) hostaddrs[i]; if ((kinds[i] 7) != 4) @@ -324,14 +348,25 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, /* FIXME: Perhaps add some smarts, like if copying several adjacent fields from host to target, use some host buffer to avoid sending each var individually. */ - devicep-device_host2dev_func((void *) (tgt-tgt_start - + k-tgt_offset), - (void *) k-host_start, - k-host_end - k-host_start); + devicep-device_host2dev_func + ((void *) (tgt-tgt_start + k-tgt_offset), + (void *) k-host_start, + k-host_end - k-host_start); break; case 4: /* POINTER */ cur_node.host_start = (uintptr_t) *(void **) k-host_start; + if (cur_node.host_start == (uintptr_t) NULL) + { + cur_node.tgt_offset = (uintptr_t) NULL; + /* Copy from host to device memory. */ + /* FIXME: see above FIXME comment. */ + devicep-device_host2dev_func + ((void *) (tgt-tgt_start + k-tgt_offset), + (void *) cur_node.tgt_offset, + sizeof (void *)); + break; + } /* Add bias to the pointer value. */ cur_node.host_start += sizes[i]; cur_node.host_end = cur_node.host_start + 1; @@ -363,11 +398,86 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, cur_node.tgt_offset -= sizes[i]; /* Copy from host to device memory. */ /* FIXME: see above FIXME comment. */ - devicep-device_host2dev_func ((void *) (tgt-tgt_start - + k-tgt_offset), - (void *) cur_node.tgt_offset, - sizeof (void *)); + devicep-device_host2dev_func + ((void *) (tgt-tgt_start + k-tgt_offset), + (void *) cur_node.tgt_offset, + sizeof (void *)); break; + case 5: /* TO_PSET */ + /* Copy from host to device memory. */ + /* FIXME: see above FIXME comment. */ + devicep-device_host2dev_func + ((void *) (tgt-tgt_start + k-tgt_offset), + (void *) k-host_start, + (k-host_end - k-host_start)); + for (j = i + 1; j mapnum; j++) + if ((kinds[j] 7) != 4) + break; + else if ((uintptr_t) hostaddrs[j] k-host_start + || ((uintptr_t) hostaddrs[j] + sizeof (void *) +k-host_end)) + break; + else + { + tgt-list[j] = k; + k-refcount++; + cur_node.host_start + = (uintptr_t) *(void **) hostaddrs[j]; + if (cur_node.host_start == (uintptr_t) NULL) + { + cur_node.tgt_offset = (uintptr_t
[PATCH 5/10] OpenACC 2.0 support for libgomp - offload image registration
This patch is by Ilya Verbin and was originally posted here: https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00591.html It implements a scheme for offloaded target-device code to register itself with the libgomp runtime. Julian -xx-xx Ilya Verbin ilya.ver...@intel.com * libgomp.map (GOMP_4.0.1): New symbol version. Add GOMP_offload_register. * plugin-host.c (device_available): Replace with: (get_num_devices): This. (get_type): New. (offload_register): Ditto. (device_init): Ditto. (device_get_table): Ditto. (device_run): Ditto. * target.c (target_type): New enum. (offload_image_descr): New struct. (offload_images, num_offload_images): New globals. (struct gomp_device_descr): Remove device_available_func. Add type, is_initialized, get_type_func, get_num_devices_func, offload_register_func, device_init_func, device_get_table_func, device_run_func. (mapping_table): New struct. (GOMP_offload_register): New function. (gomp_init_device): Ditto. (GOMP_target): Add device initialization and lookup for target fn. (GOMP_target_data): Add device initialization. (GOMP_target_update): Ditto. (gomp_load_plugin_for_device): Take handles for get_type, get_num_devices, offload_register, device_init, device_get_table, device_run functions. (gomp_register_images_for_device): New function. (gomp_find_available_plugins): Add registration of offload images. commit a8ad9504670363d8fd68e8e29f4a7455aae14446 Author: Julian Brown jul...@codesourcery.com Date: Fri Sep 19 09:16:11 2014 -0700 Merge r208657. 2014-03-18 Ilya Verbin ilya.ver...@intel.com * libgomp.map (GOMP_4.0.1): New symbol version. Add GOMP_offload_register. * plugin-host.c (device_available): Replace with: (get_num_devices): This. (get_type): New. (offload_register): Ditto. (device_init): Ditto. (device_get_table): Ditto. (device_run): Ditto. * target.c (target_type): New enum. (offload_image_descr): New struct. (offload_images, num_offload_images): New globals. (struct gomp_device_descr): Remove device_available_func. Add type, is_initialized, get_type_func, get_num_devices_func, offload_register_func, device_init_func, device_get_table_func, device_run_func. (mapping_table): New struct. (GOMP_offload_register): New function. (gomp_init_device): Ditto. (GOMP_target): Add device initialization and lookup for target fn. (GOMP_target_data): Add device initialization. (GOMP_target_update): Ditto. (gomp_load_plugin_for_device): Take handles for get_type, get_num_devices, offload_register, device_init, device_get_table, device_run functions. (gomp_register_images_for_device): New function. (gomp_find_available_plugins): Add registration of offload images. diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index b102fd8..f36df23 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -227,3 +227,8 @@ GOMP_4.0 { GOMP_target_update; GOMP_teams; } GOMP_3.0; + +GOMP_4.0.1 { + global: + GOMP_offload_register; +} GOMP_4.0; diff --git a/libgomp/plugin-host.c b/libgomp/plugin-host.c index 5354ebe..ec0c78c 100644 --- a/libgomp/plugin-host.c +++ b/libgomp/plugin-host.c @@ -33,14 +33,53 @@ #include stdlib.h #include string.h -bool -device_available (void) +const int TARGET_TYPE_HOST = 0; + +int +get_type (void) { #ifdef DEBUG printf (libgomp plugin: %s:%s\n, __FILE__, __FUNCTION__); #endif - return true; + return TARGET_TYPE_HOST; +} + +int +get_num_devices (void) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s\n, __FILE__, __FUNCTION__); +#endif + + return 1; +} + +void +offload_register (void *host_table, void *target_data) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s (%p, %p)\n, __FILE__, __FUNCTION__, + host_table, target_data); +#endif +} + +void +device_init (void) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s\n, __FILE__, __FUNCTION__); +#endif +} + +int +device_get_table (void *table) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s (%p)\n, __FILE__, __FUNCTION__, table); +#endif + + return 0; } void * @@ -82,3 +121,16 @@ void *device_host2dev (void *dest, const void *src, size_t n) return memcpy (dest, src, n); } + +void +device_run (void *fn_ptr, void *vars) +{ +#ifdef DEBUG + printf (libgomp plugin: %s:%s (%p, %p)\n, __FILE__, __FUNCTION__, fn_ptr, + vars); +#endif + + void (*fn)(void *) = (void (*)(void *)) fn_ptr; + + fn (vars); +} diff --git a/libgomp/target.c b/libgomp/target.c index d0db4c2..ef62228 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -84,6 +84,26 @@ struct
[PATCH 8/10] OpenACC 2.0 support for libgomp - temporarily work around missing __builtin_acc_on_device
The patches implementing __builtin_acc_on_device are still in processing. For the time being this patch removes the dependency on that builtin in the OpenACC runtime. Julian -xx-xx Julian Brown jul...@codesourcery.com libgomp/ * oacc-init.c (acc_on_device): Temporarily hard-code for host instead of using __builtin_acc_on_device. commit b74fb2fcb435b646499e9558a64b3989b64ad943 Author: Julian Brown jul...@codesourcery.com Date: Fri Sep 19 11:28:11 2014 -0700 Work around lack of __builtin_acc_on_device for now diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c index af2d2aa..35fe643 100644 --- a/libgomp/oacc-init.c +++ b/libgomp/oacc-init.c @@ -451,8 +451,20 @@ ialias (acc_set_device_num) int acc_on_device (acc_device_t dev) { +#if 1 + /* Support for __builtin_acc_on_device comes in later patches. */ + switch (dev) +{ +case acc_device_none: +case acc_device_host: + return 1; +default: + return 0; +} +#else /* Just rely on the compiler builtin. */ return __builtin_acc_on_device (dev); +#endif } ialias (acc_on_device)
[PATCH 9/10] OpenACC 2.0 support for libgomp - outline documentation
This patch provides some documentation for the new OpenACC bits in libgomp. Julian -xx-xx Thomas Schwinge tho...@codesourcery.com James Norris jnor...@codesourcery.com libgomp/ * libgomp.texi: Outline documentation for OpenACC. commit c1b3a366e95ff50d8f30fb0e942c0c25a51108c7 Author: Julian Brown jul...@codesourcery.com Date: Mon Sep 22 02:45:29 2014 -0700 OpenACC documentation. diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 254be57..9530a2b 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -31,10 +31,12 @@ texts being (a) (see below), and with the Back-Cover Texts being (b) @ifinfo @dircategory GNU Libraries @direntry -* libgomp: (libgomp).GNU OpenMP runtime library +* libgomp: (libgomp).GNU OpenACC and OpenMP runtime library @end direntry -This manual documents the GNU implementation of the OpenMP API for +This manual documents the GNU implementation of the OpenACC API for +offloading of code to accelerator devices in C/C++ and Fortran and +the GNU implementation of the OpenMP API for multi-platform shared-memory parallel programming in C/C++ and Fortran. Published by the Free Software Foundation @@ -48,7 +50,7 @@ Boston, MA 02110-1301 USA @setchapternewpage odd @titlepage -@title The GNU OpenMP Implementation +@title The GNU OpenACC and OpenMP Implementation @page @vskip 0pt plus 1filll @comment For the @value{version-GCC} Version* @@ -69,7 +71,10 @@ Boston, MA 02110-1301, USA@* @top Introduction @cindex Introduction -This manual documents the usage of libgomp, the GNU implementation of the +This manual documents the usage of libgomp, the GNU implementation of the +@uref{http://www.openacc.org/, OpenACC} Application Programming Interface (API) +for offloading of code to accelerator devices in C/C++ and Fortran, and +the GNU implementation of the @uref{http://www.openmp.org, OpenMP} Application Programming Interface (API) for multi-platform shared-memory parallel programming in C/C++ and Fortran. @@ -81,23 +86,619 @@ for multi-platform shared-memory parallel programming in C/C++ and Fortran. @comment better formatting. @comment @menu -* Enabling OpenMP::How to enable OpenMP for your applications. -* Runtime Library Routines:: The OpenMP runtime application programming - interface. -* Environment Variables:: Influencing runtime behavior with environment - variables. -* The libgomp ABI::Notes on the external ABI presented by libgomp. -* Reporting Bugs:: How to report bugs in GNU OpenMP. -* Copying::GNU general public license says - how you can copy and share libgomp. -* GNU Free Documentation License:: - How you can copy and share this manual. -* Funding::How to help assure continued work for free - software. -* Library Index:: Index of this documentation. +* Enabling OpenACC:: How to enable OpenACC for your + applications. +* OpenACC Runtime Library Routines:: The OpenACC runtime application + programming interface. +* OpenACC Environment Variables::Influencing OpenACC runtime behavior with + environment variables. +* OpenACC Library Interoperability:: OpenACC library interoperability with the + NVIDIA CUBLAS library. +* Enabling OpenMP:: How to enable OpenMP for your + applications. +* OpenMP Runtime Library Routines: Runtime Library Routines. + The OpenMP runtime application programming + interface. +* OpenMP Environment Variables: Environment Variables. + Influencing OpenMP runtime behavior with + environment variables. +* The libgomp ABI:: Notes on the external libgomp ABI. +* Reporting Bugs:: How to report bugs. +* Copying:: GNU general public license says how you + can copy and share libgomp. +* GNU Free Documentation License:: How you can copy and share this manual. +* Funding:: How to help assure continued work for free + software. +* Library Index::Index of this documentation. @end menu + +@c - +@c Enabling OpenACC +@c - + +@node Enabling OpenACC +@chapter Enabling OpenACC + +To activate the OpenACC extensions for C/C++ and Fortran
Re: GCC ARM: aligned access
On Mon, 1 Sep 2014 09:14:31 +0800 Peng Fan van.free...@gmail.com wrote: On 09/01/2014 08:09 AM, Matt Thomas wrote: On Aug 31, 2014, at 11:32 AM, Joel Sherrill joel.sherr...@oarcorp.com wrote: I think this is totally expected. You were passed a u8 pointer which is aligned for that type (no restrictions likely). You cast it to a type with stricter alignment requirements. The code is just flawed. Some CPUs handle unaligned accesses but not your ARM. armv7 and armv6 arch except armv6-m support unaligned access. a u8 pointer is casted to u32 pointer, should gcc take the align problem into consideration to avoid possible errors? because -mno-unaligned-access. Using -munaligned-access (or its inverse) isn't enough to make GCC generate code that can perform arbitrary unaligned accesses, because several instructions (e.g. VFP loads/stores or load/store multiple instructions IIRC) must still act on naturally-aligned data even when the hardware flag to enable unaligned accesses is on, and those instructions will still be generated by GCC when they are considered safe, i.e. when not doing explicitly-unaligned accesses in packed structures or similar. It would be *possible* to add an option to the backend to allow arbitrary alignment for any access, I think, but it's not at all clear that it's a good idea, and would certainly negatively affect performance. (If you need unaligned accesses, you can use e.g. memcpy, and that will probably generate good inline code.) Julian
Re: [PATCH] Fix GDB PR15559 (inferior calls using thiscall calling convention)
On Fri, 9 May 2014 17:33:41 +0100 Julian Brown jul...@codesourcery.com wrote: On Wed, 7 May 2014 09:41:27 -0600 Tom Tromey tro...@redhat.com wrote: Tom The usual approach is some appropriate text somewhere on the Tom GCC wiki (though I suppose a note in the mail archives would Tom do in a pinch) along with a URL in a comment in the Tom appropriate file (dwarf2.h or dwarf2.def). Tom Could you please do that? Julian How's this, as a first attempt? Julian http://gcc.gnu.org/wiki/GNUDwarfExtensions Sorry I didn't reply to this sooner. That page looks great. Thanks for doing this. Thanks! Now, does anyone want to review the patch itself? :-) Ping? Julian
Re: RTABI half-precision conversion functions (ping)
On Thu, 29 May 2014 11:16:52 +0100 Julian Brown jul...@codesourcery.com wrote: On Thu, 19 Jul 2012 14:47:54 +0100 Julian Brown jul...@codesourcery.com wrote: On Thu, 19 Jul 2012 13:54:57 +0100 Paul Brook p...@codesourcery.com wrote: But, that means EABI-conformant callers are also perfectly entitled to sign-extend half-float values before calling our helper functions (although GCC itself won't do that). Using unsigned int and taking care to only examine the low-order bits of the value in the helper function itself serves to fix the latent bug, is compatible with existing code, allows us to be conformant with the eabi, and allows use of aliases to make the __gnu and __aeabi functions the same. As long as LTO never sees this mismatch we should be fine :-) AFAIK we don't curently have any way of expressing the actual ABI. Let's not worry about that for now :-). The patch no longer applied as-is, so I've updated it (attached, re-tested). Note that there are no longer any target-independent changes (though I'm not certain that the symbol versions are still correct). OK to apply? I think this deserves a comment in the source. Otherwise it's liable to get fixed in the future :-) Something allong the lines of While the EABI describes the arguments to the half-float helper routines as 'short', it does not require that they be extended to full register width. The normal ABI requres that the caller sign/zero extend short values to 32 bit. We use unsigned int arguments to prevent the gcc making assumptions about the high half of the register. Here's a version with an explanatory comment. I also fixed a couple of minor formatting nits I noticed (they don't upset the diff too much, I don't think). It looks like this one got forgotten about. Ping? Context: https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00902.html https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00912.html This is an EABI-conformance fix. Ping? Julian
Re: Handle MULTILIB_REUSE in auto-generated SYSROOT_SUFFIX_SPEC macro
On Thu, 5 Jun 2014 20:23:27 +0100 Julian Brown jul...@codesourcery.com wrote: Hi, The print-sysroot-suffix.sh script that can be used (via the t-sysroot-suffix makefile fragment) to auto-generate the SYSROOT_SUFFIX_SPEC macro for non-trivial multilib setups does not take into account the MULTILIB_REUSE target fragment variable. I'm not sure of a way to demonstrate how this causes problems with a vanilla tree, but consider the attached patch (arm-sysroot-mlib-arrangement-1.diff) intended to create a compiler with three multilibs: Ping? (Note that no in-tree targets use both print-sysroot-suffix.sh and MULTILIB_REUSE, AFAICT, so this patch is mostly useful to 3rd-party integrators.) Julian
Re: [PATCH, ARM] Don't use NEON for autovectorization in big-endian mode
On Mon, 16 Jun 2014 12:42:36 +0100 Julian Brown jul...@codesourcery.com wrote: Hi, As discussed several times previously, support for NEON in ARM big-endian mode is quite broken because of differing assumptions about lane ordering made by the ARM EABI and the set of NEON intrinsics on the one hand, and the vectorizer on the other. Fixing this properly would involve quite a large overhaul of the NEON backend implementation, and such an overhaul does not appear to be forthcoming. Unfortunately this leaves big-endian mode with a problem: even if the user is not explicitly using NEON intrinsics, compiling with NEON and the vectorizer enabled (i.e. -O3) can quite easily lead to incorrect code being generated. This is the patch we've been using internally for a while to work around the problem. When applied: Ping? Julian
[PATCH, ARM] Don't use NEON for autovectorization in big-endian mode
Hi, As discussed several times previously, support for NEON in ARM big-endian mode is quite broken because of differing assumptions about lane ordering made by the ARM EABI and the set of NEON intrinsics on the one hand, and the vectorizer on the other. Fixing this properly would involve quite a large overhaul of the NEON backend implementation, and such an overhaul does not appear to be forthcoming. Unfortunately this leaves big-endian mode with a problem: even if the user is not explicitly using NEON intrinsics, compiling with NEON and the vectorizer enabled (i.e. -O3) can quite easily lead to incorrect code being generated. This is the patch we've been using internally for a while to work around the problem. When applied: * We do not allow Neon vectors to be used for autovectorization. Vectorization is not disabled completely: ARM core registers (e.g. four chars packed into a core register) can still be used to vectorize loops in limited circumstances. I think this is mildly preferable to forcing -ftree-vectorize to be off entirely for big-endian NEON. * Intrinsics are not touched. Those which attempt to mix generic vector operations with the ABI-defined vector types (i.e. those which are implemented with __builtin_shuffle) are, I think, technically incorrect -- but in the sense of two wrongs making a right, so the end result appears to work. * Generic vectors (i.e. direct use of __attribute__((vector_size(foo))) types) will continue to behave strangely in big-endian mode. This of course continues to be suboptimal, but at least in *the common case* we stop generating bad code. Testing in big-endian mode on user-space QEMU (ARMv7-A, NEON, softfp) shows (apart from some noise) test diffs as attached. Notice the large number of removed execution failures, in particular. OK to apply? Thanks, Julian ChangeLog gcc/ * config/arm/arm.c (arm_array_mode_supported_p): No array modes for big-endian NEON. (arm_preferred_simd_mode): Don't use NEON vectors for autovectorization in big-endian mode. (arm_autovectorize_vector_sizes): Don't iterate over other vector sizes for big-endian NEON. gcc/testsuite/ * lib/target-supports.exp (check_vect_support_and_set_flags): Don't run vect tests for big-endian ARM NEON. * gcc.target/arm/neon/vect-vcvt.c: XFAIL for !arm_little_endian. * gcc.target/arm/neon/vect-vcvtq.c: Likewise. * gcc.target/arm/neon-vshl-imm-1.c: Likewise. * gcc.target/arm/neon-vshr-imm-1.c: Likewise. * gcc.target/arm/neon-vmls-1.c: Likewise. * gcc.target/arm/neon-vmla-1.c: Likewise. * gcc.target/arm/neon-vfma-1.c: Likewise. * gcc.target/arm/neon-vfms-1.c: Likewise. * gcc.target/arm/neon-vorn-vbic.c: Likewise. * gcc.target/arm/neon-vlshr-imm-1.c: Likewise. * gcc.target/arm/neon-vcond-ltgt.c: Likewise. * gcc.target/arm/neon-vcond-gt.c: Likewise. * gcc.target/arm/neon-vcond-unordered.c: Likewise. Index: gcc/config/arm/arm.c === --- gcc/config/arm/arm.c (revision 210209) +++ gcc/config/arm/arm.c (working copy) @@ -28813,7 +28813,7 @@ static bool arm_array_mode_supported_p (enum machine_mode mode, unsigned HOST_WIDE_INT nelems) { - if (TARGET_NEON + if (TARGET_NEON !BYTES_BIG_ENDIAN (VALID_NEON_DREG_MODE (mode) || VALID_NEON_QREG_MODE (mode)) (nelems = 2 nelems = 4)) return true; @@ -28828,7 +28828,7 @@ arm_array_mode_supported_p (enum machine static enum machine_mode arm_preferred_simd_mode (enum machine_mode mode) { - if (TARGET_NEON) + if (TARGET_NEON !BYTES_BIG_ENDIAN) switch (mode) { case SFmode: @@ -29845,7 +29845,8 @@ arm_vector_alignment (const_tree type) static unsigned int arm_autovectorize_vector_sizes (void) { - return TARGET_NEON_VECTORIZE_DOUBLE ? 0 : (16 | 8); + return (TARGET_NEON_VECTORIZE_DOUBLE || (TARGET_NEON BYTES_BIG_ENDIAN)) + ? 0 : (16 | 8); } static bool Index: gcc/testsuite/gcc.target/arm/neon/vect-vcvtq.c === --- gcc/testsuite/gcc.target/arm/neon/vect-vcvtq.c (revision 210209) +++ gcc/testsuite/gcc.target/arm/neon/vect-vcvtq.c (working copy) @@ -24,5 +24,5 @@ int convert() return 0; } -/* { dg-final { scan-tree-dump-times vectorized 2 loops 1 vect } } */ +/* { dg-final { scan-tree-dump-times vectorized 2 loops 1 vect { xfail { ! arm_little_endian } } } } */ /* { dg-final { cleanup-tree-dump vect } } */ Index: gcc/testsuite/gcc.target/arm/neon/vect-vcvt.c === --- gcc/testsuite/gcc.target/arm/neon/vect-vcvt.c (revision 210209) +++ gcc/testsuite/gcc.target/arm/neon/vect-vcvt.c (working copy) @@ -24,5 +24,5 @@ int convert() return 0; } -/* { dg-final { scan-tree-dump-times vectorized 2 loops 1 vect } } */ +/* { dg-final { scan-tree-dump-times vectorized 2 loops 1 vect { xfail { ! arm_little_endian }
Handle MULTILIB_REUSE in auto-generated SYSROOT_SUFFIX_SPEC macro
Hi, The print-sysroot-suffix.sh script that can be used (via the t-sysroot-suffix makefile fragment) to auto-generate the SYSROOT_SUFFIX_SPEC macro for non-trivial multilib setups does not take into account the MULTILIB_REUSE target fragment variable. I'm not sure of a way to demonstrate how this causes problems with a vanilla tree, but consider the attached patch (arm-sysroot-mlib-arrangement-1.diff) intended to create a compiler with three multilibs: .; (little-endian, soft float) be;@mbig-endian(big-endian, soft float) vfp;@mfloat-abi=softfp (little-endian, hardware FP) Notice that we are not building a multilib for the be+vfp combination. Instead we use the MULTILIB_REUSE macro to make that combination fall back to using just the soft-float big-endian multilib: MULTILIB_REUSE = mbig-endian=mbig-endian/mfloat-abi.softfp But now, compiling code will fail with errors such as: $ arm-none-linux-gnueabi-gcc hello.c -mbig-endian -mfloat-abi=softfp \ -o hello ../arm-none-linux-gnueabi/bin/ld: /path/to/install/arm-none-linux-gnueabi/libc/usr/lib/libc.a(s_signbit.o): compiled for a little endian system and target is big endian Invoking the compiler with -print-sysroot vs. -print-multi-directory illustrates the problem: $ arm-none-linux-gnueabi-gcc hello.c -mbig-endian -mfloat-abi=softfp \ -print-sysroot /path/to/install/arm-none-linux-gnueabi/libc $ arm-none-linux-gnueabi-gcc hello.c -mbig-endian -mfloat-abi=softfp \ -print-multi-directory be What we wanted was for the first command to give the same result that invoking without -mfloat-abi=softfp does (which was the purpose of the MULTILIB_REUSE setting): $ arm-none-linux-gnueabi-gcc hello.c -mbig-endian -print-sysroot /path/to/install/arm-none-linux-gnueabi/libc/be but, that doesn't work at present. The attached patch fixes that: it's based on a part of CodeSourcery's earlier MULTILIB_ALIASES support (by Paul Brook originally, I think -- I don't think it ever made it upstream, but it worked quite similarly to MULTILIB_REUSE, that did), and allows the above multilib arrangement to work correctly. OK for mainline? (The ARM bits are for reference only and are not meant to be committed, of course.) Thanks, Julian ChangeLog gcc/ * config/print-sysroot-suffix.sh: Handle MULTILIB_REUSE settings. * config/t-sysroot-suffix (sysroot-suffix.h): Pass MULTILIB_REUSE to print-sysroot-suffix.sh script. Index: gcc/config.gcc === --- gcc/config.gcc (revision 210209) +++ gcc/config.gcc (working copy) @@ -1014,7 +1014,9 @@ arm*-*-linux-*) # ARM GNU/Linux with E ;; esac tmake_file=${tmake_file} arm/t-arm arm/t-arm-elf arm/t-bpabi arm/t-linux-eabi + tmake_file=$tmake_file t-sysroot-suffix tm_file=$tm_file arm/bpabi.h arm/linux-eabi.h arm/aout.h arm/arm.h + tm_file=$tm_file ./sysroot-suffix.h # Define multilib configuration for arm-linux-androideabi. case ${target} in *-androideabi) Index: gcc/config/arm/t-linux-eabi === --- gcc/config/arm/t-linux-eabi (revision 210209) +++ gcc/config/arm/t-linux-eabi (working copy) @@ -20,8 +20,15 @@ # CLEAR_INSN_CACHE in linux-gas.h does not work in Thumb mode. # If you set MULTILIB_OPTIONS to a non-empty value you should also set # MULTILIB_DEFAULTS in linux-elf.h. -MULTILIB_OPTIONS = -MULTILIB_DIRNAMES = +MULTILIB_OPTIONS = mbig-endian mfloat-abi=softfp +MULTILIB_DIRNAMES = be vfp +MULTILIB_OSDIRNAMES = mbig-endian=!be mfloat-abi.softfp=!vfp +MULTILIB_MATCHES = +MULTILIB_EXCEPTIONS = + +MULTILIB_REUSE = mbig-endian=mbig-endian/mfloat-abi.softfp + +MULTILIB_REQUIRED = mbig-endian mfloat-abi=softfp #MULTILIB_OPTIONS += mcpu=fa606te/mcpu=fa626te/mcpu=fmp626/mcpu=fa726te #MULTILIB_DIRNAMES+= fa606te fa626te fmp626 fa726te Index: gcc/config/print-sysroot-suffix.sh === Index: gcc/config/t-sysroot-suffix === --- gcc/config/print-sysroot-suffix.sh (revision 210209) +++ gcc/config/print-sysroot-suffix.sh (working copy) @@ -29,6 +29,7 @@ # MULTILIB_OSDIRNAMES \ # MULTILIB_OPTIONS \ # MULTILIB_MATCHES \ +# MULTILIB_REUSE # t-sysroot-suffix.h # The three options exactly correspond to the variables of the same @@ -54,6 +55,7 @@ set -e dirnames=$1 options=$2 matches=$3 +reuse=$4 cat print-sysroot-suffix3.sh \EOF #! /bin/sh @@ -80,7 +82,14 @@ shift 2 n=\ \\ $padding\ if [ $# = 0 ]; then + case $optstring in EOF +for x in $reuse; do + l=`echo $x | sed -e 's/=.*$//' -e 's/\./=/g'` + r=`echo $x | sed -e 's/^.*=//' -e 's/\./=/g'` + echo /$r/) optstring=\/$l/\ ;; print-sysroot-suffix2.sh +done +echo esac print-sysroot-suffix2.sh pat= for x in $dirnames; do --- gcc/config/t-sysroot-suffix (revision 210209) +++