Re: [PATCH, nvptx] Expand OpenACC child function arguments to use CUDA params space

Chung-Lin Tang Tue, 26 Nov 2019 06:49:27 -0800

On 2019/11/8 8:55 PM, Chung-Lin Tang wrote:

On 2019/10/8 10:05 PM, Thomas Schwinge wrote:

Hi Chung-Lin!


While we're all waiting for Tom to comment on this;-)  -- here's another
item I realized:

On 2019-09-10T19:41:59+0800, Chung-Lin Tang<chunglin_t...@mentor.com>  wrote:

The libgomp nvptx plugin changes are also quite contained, with lots of
now unneeded [...] code deleted (since we no longer first cuAlloc a
buffer for the argument record before cuLaunchKernel)

It would be nice;-)  -- but unless I'm confused, it's not that simple: we
either have to reject (force host-fallback execution) or keep supporting
"old-style" nvptx offloading code: new-libgomp has to continue to work
with nvptx offloading code once generated by old-GCC.  Possibly even a
mixture of old and new nvptx offloading code, if libraries are involved,
huh!

I have not completely thought that through, but I suppose this could be
addressed by adding a flag to the 'struct nvptx_fn' (or similar) that's
synthesized by nvptx 'mkoffload'?


Hi Thomas, Tom,
I've looked at the problem, it is unfortunate that we overlooked the
need for versioning of NVPTX images, and did not reserve something in
'struct nvptx_tdata' for something like this.

But how about something like:

typedef struct nvptx_tdata
{
   const struct targ_ptx_obj *ptx_objs;
   unsigned ptx_num;

   unsigned ptx_version;         /* <==== Add version field here.  */

   const char *const *var_names;
   unsigned var_num;

   const struct targ_fn_launch *fn_descs;
   unsigned fn_num;
} nvptx_tdata_t;

We currently only support x86_64 and powerpc64le hosts, which are both LP64 
targets.

Assuming that, the position above where I put the new 'ptx_version' field is 
already
a 32-bit sized alignment hole, doesn't change the layout of other fields, and 
in the
static 'target_data' variable generated by mkoffload should be zeroed in current
circulating binaries (unless binutils is not doing the intuitive thing...)

If these assumptions are safe, then we can treat as if ptx_version == 0 right 
now,
and from now on bump it to 1 for these new nvptx convention changes.

(We can do a similar thing in 'struct targ_fn_launch' if we want to 
differentiate
at a per-function level.)

Any considerations?


Hi Tom, Thomas,
as a concept, here is a version of what I mentioned above. The 
_exec,_async_exec plugin
hooks now switch between versions of code based on image version.

Thanks,
Chung-Lin

        gcc/
        * config/nvptx/mkoffload.c (process): Add 'ptx_version' field to
        generated struct nvptx_tdata, and initialized to '1'.
        * config/nvptx/nvptx.c (nvptx_expand_to_rtl_hook): New function
        implementing CUDA .params space transformation.
        (TARGET_EXPAND_TO_RTL_HOOK): implement hook with
        nvptx_expand_to_rtl_hook.

        libgomp/
        * plugin/plugin-nvptx.c (struct nvptx_tdata): Add 'ptx_version' field.
        (struct targ_fn_descriptor): Add 'image' field.
        (struct ptx_image_data): Adjust 'target_data' to be proper pointer
        type of 'const nvptx_tdata_t *'.
        (nvptx_exec): Adjust arguments, add kernel argument setup code,
        adjust cuLaunchKernel calling code.
        (GOMP_OFFLOAD_load_image): Remove now unneeded pointer cast for
        target_data, initialize 'image' link for each function descriptor,
        move adding of new_image to dev->images later after everythin
        is set up.
        (openacc_exec_v0): Rename from old GOMP_OFFLOAD_openacc_exec.
        (openacc_async_exec_v0): Rename from old 
GOMP_OFFLOAD_openacc_async_exec.
        (GOMP_OFFLOAD_openacc_exec): Switch between v0/v1 versions of code.
        (GOMP_OFFLOAD_openacc_async_exec): Likewise.
        (openacc_exec_v1): New function.
        (openacc_async_exec_v1): Likewise.

Index: gcc/config/nvptx/mkoffload.c
===================================================================
--- gcc/config/nvptx/mkoffload.c        (revision 278656)
+++ gcc/config/nvptx/mkoffload.c        (working copy)
@@ -310,12 +310,13 @@ process (FILE *in, FILE *out)
           "static const struct nvptx_tdata {\n"
           "  const struct ptx_obj *ptx_objs;\n"
           "  unsigned ptx_num;\n"
+          "  unsigned char ptx_version;\n"
           "  const char *const *var_names;\n"
           "  unsigned var_num;\n"
           "  const struct nvptx_fn *fn_names;\n"
           "  unsigned fn_num;\n"
           "} target_data = {\n"
-          "  ptx_objs, sizeof (ptx_objs) / sizeof (ptx_objs[0]),\n"
+          "  ptx_objs, sizeof (ptx_objs) / sizeof (ptx_objs[0]), 1,\n"
           "  var_mappings,"
           "  sizeof (var_mappings) / sizeof (var_mappings[0]),\n"
           "  func_mappings,"
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c    (revision 278656)
+++ gcc/config/nvptx/nvptx.c    (working copy)
@@ -68,6 +68,10 @@
 #include "attribs.h"
 #include "tree-vrp.h"
 #include "tree-ssa-operands.h"
+#include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "tree-cfg.h"
+#include "gimple-ssa.h"
 #include "tree-ssanames.h"
 #include "gimplify.h"
 #include "tree-phinodes.h"
@@ -6463,6 +6467,226 @@ nvptx_set_current_function (tree fndecl)
   oacc_bcast_partition = 0;
 }
 
+static void
+nvptx_expand_to_rtl_hook (void)
+{
+  /* For utilizing CUDA .param kernel arguments, we detect and modify
+     the gimple of offloaded child functions, here before RTL expansion,
+     starting with standard OMP form:
+      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
+   
+     and transform it into a style where the OMP data record fields are
+     "exploded" into individual scalar arguments:
+      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
+
+     Note that there are implicit assumptions of how OMP lowering (and/or other
+     intervening passes) behaves contained in this transformation code;
+     if those passes change in their output, this code may possibly need
+     updating.  */
+
+  if (lookup_attribute ("omp target entrypoint",
+                       DECL_ATTRIBUTES (current_function_decl))
+      /* The rather indirect manner in which OpenMP target functions are
+        launched makes this transformation only valid for OpenACC currently.
+        TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
+        needs changes for this to work with OpenMP.  */
+      && lookup_attribute ("oacc function",
+                          DECL_ATTRIBUTES (current_function_decl)))
+    {
+      tree omp_data_arg = DECL_ARGUMENTS (current_function_decl);
+      tree argtype = TREE_TYPE (omp_data_arg);
+
+      /* Ensure this function is of the form of a single reference argument
+        to the OMP data record, or a single void* argument (when no values
+        passed)  */
+      gcc_assert (VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl)))
+                 && (DECL_CHAIN (omp_data_arg) == NULL_TREE
+                     && ((TREE_CODE (argtype) == REFERENCE_TYPE
+                          && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE)
+                         || (TREE_CODE (argtype) == POINTER_TYPE
+                             && TREE_TYPE (argtype) == void_type_node))));
+      if (dump_file)
+       {
+         fprintf (dump_file, "Detected offloaded child function %s, "
+                  "starting parameter conversion\n",
+                  print_generic_expr_to_str (current_function_decl));
+         fprintf (dump_file, "OMP data record argument: %s (tree type: %s)\n",
+                  print_generic_expr_to_str (omp_data_arg),
+                  print_generic_expr_to_str (argtype));
+         fprintf (dump_file, "Data record fields:\n");
+       }
+      
+      hash_map<tree,tree> fld_to_args;
+      tree fld, rectype = TREE_TYPE (argtype);
+      tree arglist = NULL_TREE, argtypelist = NULL_TREE;
+
+      if (TREE_CODE (rectype) == RECORD_TYPE)
+       {
+         /* For each field in the OMP data record type, create a corresponding
+            PARM_DECL, and map field -> parm using the fld_to_args hash_map.
+            Also create the tree chains for creating function type and
+            DECL_ARGUMENTS below.  */
+         for (fld = TYPE_FIELDS (rectype); fld; fld = DECL_CHAIN (fld))
+           {
+             tree narg = build_decl (DECL_SOURCE_LOCATION (fld), PARM_DECL,
+                                     DECL_NAME (fld), TREE_TYPE (fld));
+             DECL_ARTIFICIAL (narg) = 1;
+             DECL_ARG_TYPE (narg) = TREE_TYPE (fld);
+             DECL_CONTEXT (narg) = current_function_decl;
+             TREE_USED (narg) = 1;
+             TREE_READONLY (narg) = 1;
+
+             if (dump_file)
+               fprintf (dump_file, "\t%s, type: %s, offset: %s bytes + %s 
bits\n",
+                        print_generic_expr_to_str (fld),
+                        print_generic_expr_to_str (TREE_TYPE (fld)),
+                        print_generic_expr_to_str (DECL_FIELD_OFFSET (fld)),
+                        print_generic_expr_to_str (DECL_FIELD_BIT_OFFSET 
(fld)));
+             fld_to_args.put (fld, narg);
+
+             TREE_CHAIN (narg) = arglist;
+             arglist = narg;
+             argtypelist = tree_cons (NULL_TREE, TREE_TYPE (narg),
+                                      argtypelist);
+           }
+         arglist = nreverse (arglist);
+         argtypelist = nreverse (argtypelist);
+       }
+      /* This is needed to not be mistaken for a stdarg function.  */
+      argtypelist = chainon (argtypelist, void_list_node);
+
+      if (dump_file)
+       {
+         fprintf (dump_file, "Function before OMP data arg replaced:\n");
+         dump_function_to_file (current_function_decl, dump_file, dump_flags);
+       }
+
+      /* Actually modify the tree type and DECL_ARGUMENTS here.  */
+      TREE_TYPE (current_function_decl) = build_function_type (void_type_node,
+                                                              argtypelist);
+      DECL_ARGUMENTS (current_function_decl) = arglist;
+
+      /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by
+        scanning and skipping those entries, creating a new local_decls list.
+        We assume a very specific MEM_REF tree expression shape.  */
+      tree decl;
+      unsigned int i;
+      vec<tree, va_gc> *new_local_decls = NULL;
+      FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl)
+       {
+         if (DECL_HAS_VALUE_EXPR_P (decl))
+           {
+             tree t = DECL_VALUE_EXPR (decl);
+             if (TREE_CODE (t) == MEM_REF
+                 && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF
+                 && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == 
MEM_REF
+                 && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0)
+                     == omp_data_arg))
+               continue;
+           }
+         vec_safe_push (new_local_decls, decl);
+       }
+      vec_free (cfun->local_decls);
+      cfun->local_decls = new_local_decls;
+      
+      /* Scan function body for assignments from .omp_data_i->FIELD, and using
+        the above created fld_to_args hash map, convert them to reads of
+        function arguments.  */
+      basic_block bb;
+      gimple_stmt_iterator gsi;
+      FOR_EACH_BB_FN (bb, cfun)
+       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+         {
+           tree val, *val_ptr = NULL;
+           gimple *stmt = gsi_stmt (gsi);
+           if (is_gimple_assign (stmt)
+               && gimple_assign_rhs_class (stmt) == GIMPLE_SINGLE_RHS)
+             val_ptr = gimple_assign_rhs1_ptr (stmt);
+           else if (is_gimple_debug (stmt) && gimple_debug_bind_p (stmt))
+             val_ptr = gimple_debug_bind_get_value_ptr (stmt);
+
+           if (val_ptr == NULL || (val = *val_ptr) == NULL_TREE)
+             continue;
+
+           tree new_val = NULL_TREE, fld = NULL_TREE;
+
+           if (TREE_CODE (val) == COMPONENT_REF
+               && TREE_CODE (TREE_OPERAND (val, 0)) == MEM_REF
+               && (TREE_CODE (TREE_OPERAND (TREE_OPERAND (val, 0), 0))
+                   == SSA_NAME)
+               && (SSA_NAME_VAR (TREE_OPERAND (TREE_OPERAND (val, 0), 0))
+                   == omp_data_arg))
+             {
+               /* .omp_data->FIELD case.  */
+               fld = TREE_OPERAND (val, 1);
+               new_val = *fld_to_args.get (fld);
+             }
+           else if (TREE_CODE (val) == MEM_REF
+                    && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME
+                    && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg)
+             {
+               /* This case may happen in the final tree level optimization
+                  output, due to SLP:
+                  vect.XX = MEM <vector(1) unsigned long> [(void 
*).omp_data_i_5(D) + 8B]
+
+                  Therefore here we need a more elaborate search of the field
+                  list to reverse map to which field the offset is referring
+                  to.  */
+               unsigned HOST_WIDE_INT offset
+                 = tree_to_uhwi (TREE_OPERAND (val, 1));
+
+               for (hash_map<tree, tree>::iterator i = fld_to_args.begin ();
+                    i != fld_to_args.end (); ++i)
+                 {
+                   tree cur_fld = (*i).first;
+                   tree cur_arg = (*i).second;
+                   gcc_assert (TREE_CODE (cur_arg) == PARM_DECL);
+
+                   unsigned HOST_WIDE_INT cur_offset =
+                     (tree_to_uhwi (DECL_FIELD_OFFSET (cur_fld))
+                      + (tree_to_uhwi (DECL_FIELD_BIT_OFFSET (cur_fld))
+                         / BITS_PER_UNIT));
+
+                   if (offset == cur_offset)
+                     {
+                       new_val = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (val),
+                                         cur_arg);
+                       break;
+                     }
+                 }
+             }
+
+           /* If we found the corresponding OMP data record field, replace the
+              RHS with the new created PARM_DECL.  */
+           if (new_val != NULL_TREE)
+             {
+               if (dump_file)
+                 {
+                   fprintf (dump_file, "For gimple stmt: ");
+                   print_gimple_stmt (dump_file, stmt, 0);
+                   fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n",
+                            print_generic_expr_to_str (val),
+                            print_generic_expr_to_str (new_val));
+                 }
+               /* Write in looked up ARG as new RHS value.  */
+               *val_ptr = new_val;
+             }
+         }
+
+      /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE.  */
+      tree name;
+      FOR_EACH_SSA_NAME (i, name, cfun)
+       if (SSA_NAME_VAR (name) == omp_data_arg)
+         (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE;
+
+      if (dump_file)
+       {
+         fprintf (dump_file, "Function after OMP data arg replaced: ");
+         dump_function_to_file (current_function_decl, dump_file, dump_flags);
+       }
+    }
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6605,6 +6829,9 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_SET_CURRENT_FUNCTION
 #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
 
+#undef TARGET_EXPAND_TO_RTL_HOOK
+#define TARGET_EXPAND_TO_RTL_HOOK nvptx_expand_to_rtl_hook
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c       (revision 278656)
+++ libgomp/plugin/plugin-nvptx.c       (working copy)
@@ -239,6 +239,8 @@ typedef struct nvptx_tdata
   const struct targ_ptx_obj *ptx_objs;
   unsigned ptx_num;
 
+  unsigned char ptx_version;
+
   const char *const *var_names;
   unsigned var_num;
 
@@ -254,12 +256,13 @@ struct targ_fn_descriptor
   const struct targ_fn_launch *launch;
   int regs_per_thread;
   int max_threads_per_block;
+  struct ptx_image_data *image;
 };
 
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
-  const void *target_data;
+  const nvptx_tdata_t *target_data;
   CUmodule module;
 
   struct targ_fn_descriptor *fns;  /* Array of functions.  */
@@ -695,16 +698,30 @@ link_ptx (CUmodule *module, const struct targ_ptx_
 
 static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
-           unsigned *dims, void *targ_mem_desc,
-           CUdeviceptr dp, CUstream stream)
+           unsigned *dims, CUdeviceptr dp, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
   int i;
-  void *kargs[1];
+  void *kargs[1] = { &dp };
+  void **kernel_args;
   struct nvptx_thread *nvthd = nvptx_thread ();
   int warp_size = nvthd->ptx_dev->warp_size;
 
+  if (__builtin_expect (dp == 0, true))
+    {
+      /* This is the newer "exploded" CUDA parameter case.  */
+      GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) 
mapnum);
+      if (mapnum > 0)
+       {
+         kernel_args = alloca (mapnum * sizeof (void *));
+         for (int i = 0; i < mapnum; i++)
+           kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+       }
+    }
+  else
+    kernel_args = kargs;
+
   function = targ_fn->fn;
 
   /* Initialize the launch dimensions.  Typically this is constant,
@@ -936,11 +953,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
                                            api_info);
     }
 
-  kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
                    dims[GOMP_DIM_GANG], 1, 1,
                    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
-                   0, stream, kargs, 0);
+                   0, stream, kernel_args, 0);
 
   if (profiling_p)
     {
@@ -1232,15 +1248,10 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version
   *target_table = targ_tbl;
 
   new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
-  new_image->target_data = target_data;
+  new_image->target_data = (const nvptx_tdata_t *) target_data;
   new_image->module = module;
   new_image->fns = targ_fns;
 
-  pthread_mutex_lock (&dev->image_lock);
-  new_image->next = dev->images;
-  dev->images = new_image;
-  pthread_mutex_unlock (&dev->image_lock);
-
   for (i = 0; i < fn_entries; i++, targ_fns++, targ_tbl++)
     {
       CUfunction function;
@@ -1257,11 +1268,17 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version
       targ_fns->launch = &fn_descs[i];
       targ_fns->regs_per_thread = nregs;
       targ_fns->max_threads_per_block = mthrs;
+      targ_fns->image = new_image;
 
       targ_tbl->start = (uintptr_t) targ_fns;
       targ_tbl->end = targ_tbl->start + 1;
     }
 
+  pthread_mutex_lock (&dev->image_lock);
+  new_image->next = dev->images;
+  dev->images = new_image;
+  pthread_mutex_unlock (&dev->image_lock);
+
   for (j = 0; j < var_entries; j++, targ_tbl++)
     {
       CUdeviceptr var;
@@ -1344,10 +1361,9 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
          && nvptx_free (ptr, ptx_devices[ord]));
 }
 
-void
-GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
-                          void **hostaddrs, void **devaddrs,
-                          unsigned *dims, void *targ_mem_desc)
+static void
+openacc_exec_v0 (void (*fn) (void *), size_t mapnum,
+                void **hostaddrs, void **devaddrs, unsigned *dims)
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
@@ -1407,8 +1423,7 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
        }
     }
 
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-             dp, NULL);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, dp, NULL);
 
   CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
   const char *maybe_abort_msg = "(perhaps abort was called)";
@@ -1424,6 +1439,43 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
 }
 
 static void
+openacc_exec_v1 (void (*fn) (void *), size_t mapnum,
+                void **hostaddrs, void **devaddrs, unsigned *dims)
+{
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, 0, NULL);
+
+  CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
+  const char *maybe_abort_msg = "(perhaps abort was called)";
+  if (r == CUDA_ERROR_LAUNCH_FAILED)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
+                      maybe_abort_msg);
+  else if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
+}
+
+void
+GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
+                          void **hostaddrs, void **devaddrs,
+                          unsigned *dims, void *targ_mem_desc)
+{
+  struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
+  unsigned ptx_version = targ_fn->image->target_data->ptx_version;
+
+  if (__builtin_expect (ptx_version == 1, true))
+    openacc_exec_v1 (fn, mapnum, hostaddrs, devaddrs, dims);
+  else
+    switch (ptx_version)
+      {
+      case 0:
+       openacc_exec_v0 (fn, mapnum, hostaddrs, devaddrs, dims);
+       break;
+      default:
+       GOMP_PLUGIN_fatal ("Unsupported PTX image code version '%u'\n",
+                          ptx_version);
+      };
+}
+
+static void
 cuda_free_argmem (void *ptr)
 {
   void **block = (void **) ptr;
@@ -1431,11 +1483,10 @@ cuda_free_argmem (void *ptr)
   free (block);
 }
 
-void
-GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
-                                void **hostaddrs, void **devaddrs,
-                                unsigned *dims, void *targ_mem_desc,
-                                struct goacc_asyncqueue *aq)
+static void
+openacc_async_exec_v0 (void (*fn) (void *), size_t mapnum,
+                      void **hostaddrs, void **devaddrs,
+                      unsigned *dims, struct goacc_asyncqueue *aq)
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
@@ -1504,13 +1555,43 @@ cuda_free_argmem (void *ptr)
        }
     }
 
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-             dp, aq->cuda_stream);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, dp, aq->cuda_stream);
 
   if (mapnum > 0)
     GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
 }
 
+static void
+openacc_async_exec_v1 (void (*fn) (void *), size_t mapnum,
+                      void **hostaddrs, void **devaddrs,
+                      unsigned *dims, struct goacc_asyncqueue *aq)
+{
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, 0, aq->cuda_stream);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
+                                void **hostaddrs, void **devaddrs,
+                                unsigned *dims, void *targ_mem_desc,
+                                struct goacc_asyncqueue *aq)
+{
+  struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
+  unsigned ptx_version = targ_fn->image->target_data->ptx_version;
+
+  if (__builtin_expect (ptx_version == 1, true))
+    openacc_async_exec_v1 (fn, mapnum, hostaddrs, devaddrs, dims, aq);
+  else
+    switch (ptx_version)
+      {
+      case 0:
+       openacc_async_exec_v0 (fn, mapnum, hostaddrs, devaddrs, dims, aq);
+       break;
+      default:
+       GOMP_PLUGIN_fatal ("Unsupported PTX image code version '%u'\n",
+                          ptx_version);
+      };
+}
+
 void *
 GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 {

Re: [PATCH, nvptx] Expand OpenACC child function arguments to use CUDA params space

Reply via email to