In my examination of BabelStream results on AMD GCN, I've found that,
for each BabelStream kernel execution, we spend significant time in
allocating and initializing memory in gomp_map_vars (~55µs, whereas the
actual BabelStream code executes in ~746µs, meaning we increase the time
BabelStream measures by 7% just on that).
Upon further examination, I've found that the only reason gomp_map_vars
decides to allocate and map any memory in the first place is because it
is constructing the table of pointers to variables on the target, which
I've taken to calling the "target variable table". Given that the GCN
plugin already must perform some memory allocation before starting up a
kernel, namely to allocate kernel arguments, it would be beneficial if
we could merge this allocation with the kernel arguments allocation.
This patch enables that: a plugin can provide the
GOMP_OFFLOAD_get_max_host_target_var_table_size function, which returns
the number of extra bytes that the plugin can merge into its allocations
if need be.
libgomp will, if it determines that the number of bytes in the target
variable table is less than this value, skip allocating target memory
for the target variable table, and will instead place a copy of it in
host memory. The plugin will be given a pointer to this host memory,
rather than the otherwise-allocated target memory, from which it will do
whatever it deems necessary.
In the case of AMD GCN, the plugin declares that any size table can be
merged with kernel arguments (as allocating kernel arguments is always
necessary[1]), and will copy this table into the kernel arguments, and
substitute the pointer to it with a pointer to kernel arguments on the
target side. This eliminates the runtime of gomp_map_vars
near-entirely.
[1] It may be beneficial to reduce this anyway, to some fixed amount, to
make it so that the future allocation cache has a higher cache hit
rate. It may also depend on whether hsa_memory_allocate for kernel
arguments takes runtime proportional to the number of bytes it needs
to allocate.
include/ChangeLog:
* gomp-constants.h (GOMP_VERSION): Bump, due to change in
interface of
GOMP_OFFLOAD_{{async_,}run,openacc_exec,openacc_async_exec}.
libgomp/ChangeLog:
* libgomp-plugin.h (GOMP_OFFLOAD_run): Add new size_t parameter
HOST_VARS_SIZE.
(GOMP_OFFLOAD_async_run): Ditto.
(GOMP_OFFLOAD_openacc_exec): Ditto.
(GOMP_OFFLOAD_openacc_async_exec): Ditto.
(GOMP_OFFLOAD_get_max_host_target_var_table_size): New function.
Returns maximum size (in bytes) of target variable table that
should be passed to plugin host-side.
* libgomp.h (struct gomp_target_task): Add host_tvt field.
Holds pointer to host-side target variable table, for freeing
when task is finished.
(struct gomp_device_descr): Add field for
GOMP_OFFLOAD_get_max_host_target_var_table_size function
pointer, and a variable in which to cache its result.
(gomp_get_host_tvt_size_for_dev): New. Defined in target.c.
* oacc-host.c (host_run): Update signature due to changes in
libgomp-plugin.h.
(host_openacc_exec): Ditto.
(host_openacc_async_exec): Ditto.
* target.c (gomp_get_tvt_size): New function. Returns size of
target variable table for MAPNUM mappings.
(gomp_map_vars_internal): Add new parameter
TARGET_VARIABLE_TABLE. If non-NULL, points to host-side
allocation of the target variable table. Update logic for
allocating target-side target variable table in the function to
match.
<target variable table population>: If target_variable_table
is non-NULL, populate it with our pointers using memcpy.
(gomp_map_vars): Update to pass through TARGET_VARIABLE_TABLE.
(goacc_map_vars): Ditto.
(gomp_get_host_tvt_size_for_dev): New function. Returns number
of bytes that should be allocated for the host-side target
variable table, if it should be allocated on the host, or zero
otherwise.
(GOMP_target): Add logic to allocate and clean up host-side
target variable table when needed.
(GOMP_target_ext): Ditto.
(gomp_target_task_fn): Ditto.
(gomp_target_data_fallback): Update call to gomp_map_vars due to
new parameter.
(GOMP_target_data): Ditto.
(GOMP_target_data_ext): Ditto.
(GOMP_target_enter_exit_data): Ditto.
(gomp_load_plugin_for_device): Load
GOMP_OFFLOAD_get_max_host_target_var_table_size, and populate
max_host_tvt_size accordingly.
* oacc-mem.c (acc_map_data): Update call to gomp_map_vars due to
new parameter.
(goacc_enter_datum): Ditto.
(goacc_enter_data_internal): Ditto.
* oacc-parallel.c (GOACC_parallel_keyed): Add logic to allocate
and clean up host-side target variable table when needed.
(GOACC_data_start): Update call to goacc_map_vars due to
new parameter.
* plugin/plugin-gcn.c (struct kernargs): Add
target_variable_table FAM to populate with the host-side target
variable table.
(create_kernel_dispatch): Take a HOST_VARS_SIZE parameter, and
allocate extra HOST_VARS_SIZE for target variable table on end
of kernargs.
(run_kernel): Take extra HOST_VARS_SIZE. Pass it to
create_kernel_dispatch. Populate
kernargs->target_variable_table if HOST_VARS_SIZE != 0, and
populate kernargs pointer to VARS accordingly.
(destroy_module): Update call to run_kernel to pass 0 as
HOST_VARS_SIZE (as there's no VARS, there's no host-side target
variable table).
(GOMP_OFFLOAD_load_image): Ditto.
(struct kernel_launch): Add HOST_VARS_SIZE field, to carry that
information into the async queue.
(execute_queue_entry): Pass that field to run_kernel.
(queue_push_launch): Take an extra argument to populate that
field.
(gcn_exec): Take HOST_VARS_SIZE parameter in order to pass it
along to queue_push_launch and run_kernel.
(GOMP_OFFLOAD_run): Take and pass along HOST_VARS_SIZE.
(GOMP_OFFLOAD_async_run): Ditto.
(GOMP_OFFLOAD_openacc_exec): Ditto.
(GOMP_OFFLOAD_openacc_async_exec): Ditto.
(GOMP_OFFLOAD_get_max_host_target_var_table_size): Implement.
Returns max size_t unless GCN_INHIBIT_KERNARGS_TVT_MERGE env var
is set (for testing).
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_run): Assert that target
variable table is not host-side.
* testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c:
New test.
* testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c:
New test.
---
include/gomp-constants.h | 2 +-
libgomp/libgomp-plugin.h | 63 ++++++-
libgomp/libgomp.h | 15 +-
libgomp/oacc-host.c | 9 +-
libgomp/oacc-mem.c | 8 +-
libgomp/oacc-parallel.c | 39 ++++-
libgomp/plugin/plugin-gcn.c | 105 +++++++++---
libgomp/plugin/plugin-nvptx.c | 6 +-
libgomp/target.c | 159 ++++++++++++++----
libgomp/task.c | 1 +
.../gcn-kernel-launch-no-tvt-alloc.c | 51 ++++++
.../gcn-kernel-launch-tvt-alloc.c | 16 ++
12 files changed, 398 insertions(+), 76 deletions(-)
create mode 100644
libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
create mode 100644
libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c
diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index 0a0761043f96..8304ae839fd1 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -332,7 +332,7 @@ enum gomp_map_kind
/* Versions of libgomp and device-specific plugins. GOMP_VERSION
should be incremented whenever an ABI-incompatible change is introduced
to the plugin interface defined in libgomp/libgomp.h. */
-#define GOMP_VERSION 3
+#define GOMP_VERSION 4
#define GOMP_VERSION_NVIDIA_PTX 1
#define GOMP_VERSION_GCN 3
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index bb4d577b66d7..2f8488afa142 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -188,11 +188,65 @@ extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t,
size_t, size_t, void *,
size_t);
extern bool GOMP_OFFLOAD_memset (int, void *, int, size_t);
extern bool GOMP_OFFLOAD_can_run (void *);
-extern void GOMP_OFFLOAD_run (int, void *, void *, void **);
-extern void GOMP_OFFLOAD_async_run (int, void *, void *, void **, void *);
+extern void GOMP_OFFLOAD_run (int, void *, void *, void **, size_t);
+extern void GOMP_OFFLOAD_async_run (int, void *, void *, void **, void *,
+ size_t);
+
+/* Note [Host-side target variable table]
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ The target variable table is a table generated by gomp_map_vars and related
+ functions which contains pointers and values, passed as the third argument
+ to GOMP_OFFLOAD_run (often named VARS), and to other similar functions.
+ Normally, this table is located in target memory, and populated by
+ gomp_map_vars_internal. It is placed at the very start of the memory of a
+ target memory descriptor.
+
+ However, in cases where there is no target memory to map, since all
+ non-OFFSET_INLINE variables are already on the device, this table will be
+ the only memory allocated for a given target memory descriptor by
+ gomp_map_vars_internal.
+
+ This is sub-optimal, as device allocations are quite slow, and as writing to
+ device memory is also quite slow.
+
+ On some offload targets, like AMD GCN, the target plugin needs to perform
+ additional allocation anyway in order to be able to launch a kernel.
+
+ It would be useful if the target variable table allocation could be merged
+ with such unavoidable allocations.
+
+ A plugin can request that this happens by implementing the optional
+ function GOMP_OFFLOAD_get_max_host_target_var_table_size (whose return value
+ is presumed zero if not implemented). This function returns the maximum
+ number of bytes that the plugin can merge into its unavoidable allocations
+ for purposes of passing the target variable table.
+
+ For instance, if a plugin declares that it can merge (8 * sizeof (void*))
+ bytes into its unavoidable allocations, and the target variable table is of
+ size (6 * sizeof (void*)), libgomp will not produce new device memory for
+ the host variable table.
+
+ Instead, it will allocate plain host memory of (6 * sizeof (void*)), and
+ pass this allocation to GOMP_OFFLOAD_run and co. rather than the
+ device-side allocation of the target variable table.
+
+ The plugin can distinguish between these two cases based on the fifth
+ argument to GOMP_OFFLOAD_run (named HOST_VARS_SIZE). Iff this value is
+ non-zero, then VARS is a host-side allocation of size HOST_VARS_SIZE. The
+ plugin shall then ensure that the kernel it launches gets a pointer to the
+ table with the contents at VARS of size HOST_VARS_SIZE. In the above
+ example, on AMD GCN, this is done by the table being added to the end of the
+ GCN kernel arguments; the contents of VARS get copied into extra room in
+ the kernel arguments allocation, and then the pointer to this extra room
+ gets substituted for VARS.
+
+ Note that the ownership of the host-allocated VARS is not transferred to the
+ plugin. The caller is responsible for cleaning it up. */
+extern size_t GOMP_OFFLOAD_get_max_host_target_var_table_size (void);
extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
- void **, unsigned *, void *);
+ void **, unsigned *, void *, size_t);
extern void *GOMP_OFFLOAD_openacc_create_thread_data (int);
extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *);
extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (int);
@@ -205,7 +259,8 @@ extern void GOMP_OFFLOAD_openacc_async_queue_callback
(struct goacc_asyncqueue *
void (*)(void *), void
*);
extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void
**,
void **, unsigned *, void *,
- struct goacc_asyncqueue *);
+ struct goacc_asyncqueue *,
+ size_t);
extern bool GOMP_OFFLOAD_openacc_async_dev2host (int, void *, const void *,
size_t,
struct goacc_asyncqueue *);
extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *,
size_t,
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 42f324392957..0d981e599e57 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -770,7 +770,12 @@ struct gomp_target_task
struct target_mem_desc *tgt;
struct gomp_task *task;
struct gomp_team *team;
+
/* Device-specific target arguments. */
+
+ /* If non-NULL, to be freed upon TASK_FINISH. This pointer points to the
+ host-side target variable table that was given to plugins. */
+ void **host_tvt;
void **args;
void *hostaddrs[];
};
@@ -1410,6 +1415,10 @@ struct gomp_device_descr
/* This is the TYPE of device. */
enum offload_target_type type;
+ /* Maximum size of host-side target variable table the device can support.
+ See Note [Host-side target variable table]. */
+ size_t max_host_tvt_size;
+
/* Function handlers. */
__typeof (GOMP_OFFLOAD_get_name) *get_name_func;
__typeof (GOMP_OFFLOAD_get_uid) *get_uid_func;
@@ -1437,6 +1446,8 @@ struct gomp_device_descr
__typeof (GOMP_OFFLOAD_can_run) *can_run_func;
__typeof (GOMP_OFFLOAD_run) *run_func;
__typeof (GOMP_OFFLOAD_async_run) *async_run_func;
+ __typeof (GOMP_OFFLOAD_get_max_host_target_var_table_size) *
+ get_max_host_target_var_table_size_func;
__typeof (GOMP_OFFLOAD_interop) *interop_func;
__typeof (GOMP_OFFLOAD_get_interop_int) *get_interop_int_func;
__typeof (GOMP_OFFLOAD_get_interop_ptr) *get_interop_ptr_func;
@@ -1493,7 +1504,9 @@ extern struct target_mem_desc *goacc_map_vars (struct
gomp_device_descr *,
struct goacc_asyncqueue *,
size_t, void **, void **,
size_t *, void *, bool,
- enum gomp_map_vars_kind);
+ enum gomp_map_vars_kind,
+ void **);
+extern size_t gomp_get_host_tvt_size_for_dev (struct gomp_device_descr *,
size_t);
extern void goacc_unmap_vars (struct target_mem_desc *, bool,
struct goacc_asyncqueue *);
extern void gomp_init_device (struct gomp_device_descr *);
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 028a5c943b7e..f5b8d74efb98 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -129,7 +129,8 @@ host_host2dev (int n __attribute__ ((unused)),
static void
host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars,
- void **args __attribute__((unused)))
+ void **args __attribute__((unused)),
+ size_t host_tvt_size __attribute__((unused)))
{
void (*fn)(void *) = (void (*)(void *)) fn_ptr;
@@ -142,7 +143,8 @@ host_openacc_exec (void (*fn) (void *),
void **hostaddrs,
void **devaddrs __attribute__ ((unused)),
unsigned *dims __attribute__ ((unused)),
- void *targ_mem_desc __attribute__ ((unused)))
+ void *targ_mem_desc __attribute__ ((unused)),
+ size_t host_tvt_size __attribute__ ((unused)))
{
fn (hostaddrs);
}
@@ -154,7 +156,8 @@ host_openacc_async_exec (void (*fn) (void *),
void **devaddrs __attribute__ ((unused)),
unsigned *dims __attribute__ ((unused)),
void *targ_mem_desc __attribute__ ((unused)),
- struct goacc_asyncqueue *aq __attribute__ ((unused)))
+ struct goacc_asyncqueue *aq __attribute__ ((unused)),
+ size_t host_tvt_size __attribute__ ((unused)))
{
fn (hostaddrs);
}
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 738281f5701c..5601daf13957 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -421,7 +421,7 @@ acc_map_data (void *h, void *d, size_t s)
struct target_mem_desc *tgt
= goacc_map_vars (acc_dev, NULL, mapnum, &hostaddrs, &devaddrs, &sizes,
- &kinds, true, GOMP_MAP_VARS_ENTER_DATA);
+ &kinds, true, GOMP_MAP_VARS_ENTER_DATA, NULL);
assert (tgt);
assert (tgt->list_count == 1);
splay_tree_key n = tgt->list[0].key;
@@ -586,7 +586,7 @@ goacc_enter_datum (void **hostaddrs, size_t *sizes, void
*kinds, int async)
struct target_mem_desc *tgt
= goacc_map_vars (acc_dev, aq, mapnum, hostaddrs, NULL, sizes,
- kinds, true, GOMP_MAP_VARS_ENTER_DATA);
+ kinds, true, GOMP_MAP_VARS_ENTER_DATA, NULL);
assert (tgt);
assert (tgt->list_count == 1);
n = tgt->list[0].key;
@@ -1225,7 +1225,7 @@ goacc_enter_data_internal (struct gomp_device_descr
*acc_dev, size_t mapnum,
struct target_mem_desc *tgt_ __attribute__((unused))
= goacc_map_vars (acc_dev, aq, groupnum, &hostaddrs[i], NULL,
&sizes[i], &kinds[i], true,
- GOMP_MAP_VARS_ENTER_DATA);
+ GOMP_MAP_VARS_ENTER_DATA, NULL);
assert (tgt_ == NULL);
gomp_mutex_lock (&acc_dev->lock);
@@ -1276,7 +1276,7 @@ goacc_enter_data_internal (struct gomp_device_descr
*acc_dev, size_t mapnum,
struct target_mem_desc *tgt
= goacc_map_vars (acc_dev, aq, groupnum, &hostaddrs[i], NULL,
&sizes[i], &kinds[i], true,
- GOMP_MAP_VARS_ENTER_DATA);
+ GOMP_MAP_VARS_ENTER_DATA, NULL);
assert (tgt);
gomp_mutex_lock (&acc_dev->lock);
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 9f48c8b7f644..7fc14ca4bc2b 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -291,9 +291,28 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
goacc_aq aq = get_goacc_asyncqueue (async);
+ /* Prepare the host-side target variable table, if needed. For the logic
+ here, see Note [Host-side target variable table]. */
+ size_t host_tvt_size = gomp_get_host_tvt_size_for_dev (acc_dev, mapnum);
+ void **host_tvt = NULL;
+ if (host_tvt_size)
+ {
+ /* We need a table. */
+ if (aq)
+ /* Async execution. We need to heap-allocate. */
+ host_tvt = malloc (host_tvt_size);
+ else
+ /* We're executing sync. Prefer faster stack allocation. */
+ host_tvt = alloca (host_tvt_size);
+ }
+
+ if (!host_tvt)
+ /* Seems that the allocation failed. Fall back to the usual route. */
+ host_tvt_size = 0;
+
struct target_mem_desc *tgt
= goacc_map_vars (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds, true,
- GOMP_MAP_VARS_TARGET);
+ GOMP_MAP_VARS_TARGET, host_tvt);
if (profiling_p)
{
@@ -304,13 +323,13 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
&api_info);
}
- void **devaddrs = (void **) tgt->tgt_start;
+ void **devaddrs = host_tvt_size ? host_tvt : (void **) tgt->tgt_start;
if (aq == NULL)
acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, dims,
- tgt);
+ tgt, host_tvt_size);
else
acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
- dims, tgt, aq);
+ dims, tgt, aq, host_tvt_size);
if (profiling_p)
{
@@ -324,6 +343,14 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
/* If running synchronously (aq == NULL), this will unmap immediately. */
goacc_unmap_vars (tgt, true, aq);
+ if (host_tvt && aq)
+ {
+ /* Clean up the host-side target variable table if it was allocated on
+ the heap. Note that this implies that the execution of the kernel may
+ not have happened yet, so we must enqueue the cleanup. */
+ acc_dev->openacc.async.queue_callback_func (aq, free, host_tvt);
+ }
+
if (profiling_p)
{
prof_info.event_type = acc_ev_exit_data_end;
@@ -454,7 +481,7 @@ GOACC_data_start (int flags_m, size_t mapnum,
{
prof_info.device_type = acc_device_host;
api_info.device_type = prof_info.device_type;
- tgt = goacc_map_vars (NULL, NULL, 0, NULL, NULL, NULL, NULL, true, 0);
+ tgt = goacc_map_vars (NULL, NULL, 0, NULL, NULL, NULL, NULL, true, 0,
NULL);
tgt->prev = thr->mapped_data;
thr->mapped_data = tgt;
@@ -463,7 +490,7 @@ GOACC_data_start (int flags_m, size_t mapnum,
gomp_debug (0, " %s: prepare mappings\n", __FUNCTION__);
tgt = goacc_map_vars (acc_dev, NULL, mapnum, hostaddrs, NULL, sizes, kinds,
- true, 0);
+ true, 0, NULL);
gomp_debug (0, " %s: mappings prepared\n", __FUNCTION__);
tgt->prev = thr->mapped_data;
thr->mapped_data = tgt;
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 3f0577fa82e6..3b94b825cdbd 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -303,6 +303,10 @@ struct kernargs {
/* Output data. */
struct output output_data;
+
+ /* Target variable table. Size determined by gomp_map_vars. See Note
+ [Host-side target variable table]. */
+ void *target_variable_table[];
};
/* A queue entry for a future asynchronous launch. */
@@ -312,6 +316,7 @@ struct kernel_launch
struct kernel_info *kernel;
void *vars;
struct GOMP_kernel_launch_attributes kla;
+ size_t host_vars_size;
};
/* A queue entry for a future callback. */
@@ -2005,11 +2010,14 @@ alloc_by_agent (struct agent_info *agent, size_t size)
}
/* Create kernel dispatch data structure for given KERNEL, along with
- the necessary device signals and memory allocations. */
+ the necessary device signals and memory allocations.
+
+ For HOST_VARS_SIZE, see Note [Host-side target variable table]. */
static struct kernel_dispatch *
create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
- int num_threads)
+ int num_threads,
+ size_t host_vars_size)
{
struct agent_info *agent = kernel->agent;
struct kernel_dispatch *shadow
@@ -2055,7 +2063,7 @@ create_kernel_dispatch (struct kernel_info *kernel, int
num_teams,
}
status = hsa_fns.hsa_memory_allocate_fn (agent->kernarg_region,
- sizeof (struct kernargs),
+ sizeof (struct kernargs) +
host_vars_size,
&shadow->kernarg_address);
if (status != HSA_STATUS_SUCCESS)
hsa_fatal ("Could not allocate memory for GCN kernel arguments", status);
@@ -2268,12 +2276,17 @@ init_kernel (struct kernel_info *kernel)
MODULE_LOCKED indicates that the caller already holds the lock and
run_kernel need not lock it again.
- If AQ is NULL then agent->sync_queue will be used. */
+ If AQ is NULL then agent->sync_queue will be used.
+
+ HOST_VARS_SIZE is the size of the table VARS points to in host memory, that
+ ought to be copied into kernel arguments, or zero, if VARS is in target
+ memory. See Note [Host-side target variable table]. */
static void
run_kernel (struct kernel_info *kernel, void *vars,
struct GOMP_kernel_launch_attributes *kla,
- struct goacc_asyncqueue *aq, bool module_locked)
+ struct goacc_asyncqueue *aq, bool module_locked,
+ size_t host_vars_size)
{
struct agent_info *agent = kernel->agent;
GCN_DEBUG ("SGPRs: %d, VGPRs: %d\n", kernel->description->sgpr_count,
@@ -2394,7 +2407,8 @@ run_kernel (struct kernel_info *kernel, void *vars,
struct kernel_dispatch *shadow
= create_kernel_dispatch (kernel, packet->grid_size_x,
- packet->grid_size_z);
+ packet->grid_size_z,
+ host_vars_size);
shadow->queue = command_q;
if (debug)
@@ -2411,7 +2425,25 @@ run_kernel (struct kernel_info *kernel, void *vars,
s.handle = shadow->signal;
packet->completion_signal = s;
hsa_fns.hsa_signal_store_relaxed_fn (s, 1);
- memcpy (shadow->kernarg_address, &vars, sizeof (vars));
+
+ /* See Note [Host-side target variable table]. */
+ struct kernargs *kernargs = packet->kernarg_address;
+ if (host_vars_size != 0)
+ {
+ /* TODO(arsen): It would be nice to avoid this copy, by somehow letting
+ the caller allocate the kernel arguments along with this table. This
+ would require a large refactor, and doing this first also yields
+ significant improvements, so this is a decent start.
+
+ A design that does so needs to decouple kernel argument allocation
+ from the actual kernel-executing functions. */
+ void *(*tvt_addr)[] = &kernargs->target_variable_table;
+ memcpy (tvt_addr, vars, host_vars_size);
+ memcpy (kernargs, &tvt_addr, sizeof (tvt_addr));
+ }
+ else
+ /* The table is already on the device. No need to copy it. */
+ memcpy (kernargs, &vars, sizeof (vars));
GCN_DEBUG ("Copying kernel runtime pointer to kernarg_address\n");
@@ -2441,7 +2473,6 @@ run_kernel (struct kernel_info *kernel, void *vars,
}
console_output (kernel, shadow->kernarg_address, true);
- struct kernargs *kernargs = shadow->kernarg_address;
unsigned int return_value = (unsigned int)kernargs->output_data.return_value;
release_kernel_dispatch (shadow);
@@ -2756,7 +2787,7 @@ destroy_module (struct module_info *module, bool locked)
if (module->fini_array_func)
{
init_kernel (module->fini_array_func);
- run_kernel (module->fini_array_func, NULL, &kla, NULL, locked);
+ run_kernel (module->fini_array_func, NULL, &kla, NULL, locked, 0);
}
module->constructors_run_p = false;
@@ -2790,7 +2821,8 @@ execute_queue_entry (struct goacc_asyncqueue *aq, int
index)
aq->agent->device_id, aq->id, index);
run_kernel (entry->u.launch.kernel,
entry->u.launch.vars,
- &entry->u.launch.kla, aq, false);
+ &entry->u.launch.kla, aq, false,
+ entry->u.launch.host_vars_size);
if (DEBUG_QUEUES)
GCN_DEBUG ("Async thread %d:%d: Executing launch entry (%d) done\n",
aq->agent->device_id, aq->id, index);
@@ -2948,11 +2980,14 @@ wait_for_queue_nonfull (struct goacc_asyncqueue *aq)
/* Request an asynchronous kernel launch on the specified queue. This
may block if the queue is full, but returns without waiting for the
- kernel to run. */
+ kernel to run.
+
+ For HOST_VARS_SIZE, see Note [Host-side target variable table]. */
static void
queue_push_launch (struct goacc_asyncqueue *aq, struct kernel_info *kernel,
- void *vars, struct GOMP_kernel_launch_attributes *kla)
+ void *vars, struct GOMP_kernel_launch_attributes *kla,
+ size_t host_vars_size)
{
assert (aq->agent == kernel->agent);
@@ -2970,6 +3005,7 @@ queue_push_launch (struct goacc_asyncqueue *aq, struct
kernel_info *kernel,
aq->queue[queue_last].u.launch.kernel = kernel;
aq->queue[queue_last].u.launch.vars = vars;
aq->queue[queue_last].u.launch.kla = *kla;
+ aq->queue[queue_last].u.launch.host_vars_size = host_vars_size;
aq->queue_n++;
@@ -3371,7 +3407,7 @@ managed_heap_create (struct agent_info *agent, size_t
size)
static void
gcn_exec (struct kernel_info *kernel,
void **devaddrs, unsigned *dims, void *targ_mem_desc, bool async,
- struct goacc_asyncqueue *aq)
+ struct goacc_asyncqueue *aq, size_t host_vars_size)
{
if (!GOMP_OFFLOAD_can_run (kernel))
GOMP_PLUGIN_fatal ("OpenACC host fallback unimplemented.");
@@ -3490,9 +3526,9 @@ gcn_exec (struct kernel_info *kernel,
}
if (!async)
- run_kernel (kernel, devaddrs, &kla, NULL, false);
+ run_kernel (kernel, devaddrs, &kla, NULL, false, host_vars_size);
else
- queue_push_launch (aq, kernel, devaddrs, &kla);
+ queue_push_launch (aq, kernel, devaddrs, &kla, host_vars_size);
if (profiling_dispatch_p)
{
@@ -4061,7 +4097,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const
void *target_data,
if (module->init_array_func)
{
init_kernel (module->init_array_func);
- run_kernel (module->init_array_func, NULL, &kla, NULL, false);
+ run_kernel (module->init_array_func, NULL, &kla, NULL, false, 0);
}
module->constructors_run_p = true;
@@ -5183,10 +5219,13 @@ GOMP_OFFLOAD_get_interop_type_desc (struct
interop_obj_t *obj,
/* Run a synchronous OpenMP kernel on DEVICE and pass it an array of pointers
in VARS as a parameter. The kernel is identified by FN_PTR which must point
to a kernel_info structure, and must have previously been loaded to the
- specified device. */
+ specified device.
+
+ For HOST_VARS_SIZE, see Note [Host-side target variable table]. */
void
-GOMP_OFFLOAD_run (int device, void *fn_ptr, void *vars, void **args)
+GOMP_OFFLOAD_run (int device, void *fn_ptr, void *vars, void **args,
+ size_t host_vars_size)
{
struct agent_info *agent = get_agent_info (device);
struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
@@ -5202,7 +5241,7 @@ GOMP_OFFLOAD_run (int device, void *fn_ptr, void *vars,
void **args)
GCN_WARNING ("Will not run GCN kernel because the grid size is zero\n");
return;
}
- run_kernel (kernel, vars, kla, NULL, false);
+ run_kernel (kernel, vars, kla, NULL, false, host_vars_size);
}
/* Run an asynchronous OpenMP kernel on DEVICE. This is similar to
@@ -5211,7 +5250,7 @@ GOMP_OFFLOAD_run (int device, void *fn_ptr, void *vars,
void **args)
void
GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
- void **args, void *async_data)
+ void **args, void *async_data, size_t host_vars_size)
{
GCN_DEBUG ("GOMP_OFFLOAD_async_run invoked\n");
struct agent_info *agent = get_agent_info (device);
@@ -5232,11 +5271,25 @@ GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void
*tgt_vars,
maybe_init_omp_async (agent);
if (!agent->omp_async_queue)
GOMP_PLUGIN_fatal ("Asynchronous queue initialization failed");
- queue_push_launch (agent->omp_async_queue, kernel, tgt_vars, kla);
+ queue_push_launch (agent->omp_async_queue, kernel, tgt_vars, kla,
+ host_vars_size);
queue_push_callback (agent->omp_async_queue,
GOMP_PLUGIN_target_task_completion, async_data);
}
+/* See Note [Host-side target variable table]. */
+
+size_t
+GOMP_OFFLOAD_get_max_host_target_var_table_size (void)
+{
+ if (secure_getenv ("GCN_INHIBIT_KERNARGS_TVT_MERGE"))
+ /* Used for testing. */
+ return 0;
+
+ /* We can do any size. */
+ return -1;
+}
+
/* Allocate memory suitable for Managed Memory. */
void *
@@ -5380,11 +5433,12 @@ GOMP_OFFLOAD_openacc_exec (void (*fn_ptr) (void *),
size_t mapnum __attribute__((unused)),
void **hostaddrs __attribute__((unused)),
void **devaddrs, unsigned *dims,
- void *targ_mem_desc)
+ void *targ_mem_desc,
+ size_t host_vars_size)
{
struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
- gcn_exec (kernel, devaddrs, dims, targ_mem_desc, false, NULL);
+ gcn_exec (kernel, devaddrs, dims, targ_mem_desc, false, NULL,
host_vars_size);
}
/* Run an asynchronous OpenACC kernel on the specified queue. */
@@ -5395,11 +5449,12 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn_ptr) (void
*),
void **hostaddrs __attribute__((unused)),
void **devaddrs,
unsigned *dims, void *targ_mem_desc,
- struct goacc_asyncqueue *aq)
+ struct goacc_asyncqueue *aq,
+ size_t host_vars_size)
{
struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
- gcn_exec (kernel, devaddrs, dims, targ_mem_desc, true, aq);
+ gcn_exec (kernel, devaddrs, dims, targ_mem_desc, true, aq, host_vars_size);
}
/* Create a new asynchronous thread and queue for running future kernels. */
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index a540e9d4cce8..457ddc959d5b 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -2957,8 +2957,12 @@ GOMP_OFFLOAD_get_interop_type_desc (struct interop_obj_t
*obj,
}
void
-GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
+GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args,
+ size_t host_tgt_vars_size)
{
+ /* Not supported for NVPTX, at least not currently, as we don't provide
+ GOMP_OFFLOAD_get_max_host_target_var_table_size. */
+ assert (host_tgt_vars_size == 0);
struct targ_fn_descriptor *tgt_fn_desc
= (struct targ_fn_descriptor *) tgt_fn;
CUfunction function = tgt_fn_desc->fn;
diff --git a/libgomp/target.c b/libgomp/target.c
index 29e9a2c6367f..7308d912d681 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1173,14 +1173,31 @@ gomp_present_fatal (void *addr, size_t size, struct
gomp_device_descr *devicep)
#endif
}
+/* Get size of region required for target variable table for MAPNUM
+ mappings. */
+
+static inline size_t
+gomp_get_tvt_size (size_t mapnum)
+{
+ return mapnum * sizeof (void *);
+}
+
static inline __attribute__((always_inline)) struct target_mem_desc *
gomp_map_vars_internal (struct gomp_device_descr *devicep,
struct goacc_asyncqueue *aq, size_t mapnum,
void **hostaddrs, void **devaddrs, size_t *sizes,
void *kinds, bool short_mapkind,
htab_t *refcount_set,
- enum gomp_map_vars_kind pragma_kind)
+ enum gomp_map_vars_kind pragma_kind,
+ /* See Note [Host-side target variable table]. This
+ is non-NULL when there's a host-side target
+ variable table allocation. Must be sized according
+ to gomp_get_tvt_size. */
+ void **target_variable_table)
{
+ assert (/* target_variable_table implies target mapping. */
+ !target_variable_table
+ || (pragma_kind & GOMP_MAP_VARS_TARGET));
size_t i, tgt_align, tgt_size, not_found_cnt = 0;
bool has_firstprivate = false;
bool has_always_ptrset = false;
@@ -1202,6 +1219,12 @@ gomp_map_vars_internal (struct gomp_device_descr
*devicep,
tgt->prev = NULL;
struct gomp_coalesce_buf cbuf, *cbufp = NULL;
+ /* Non-zero if a new allocation for the target variable table is needed,
+ i.e. when we're mapping for a target offload region and haven't already
+ allocated it elsewhere. */
+ bool need_tvt_alloc = ((pragma_kind & GOMP_MAP_VARS_TARGET)
+ && !target_variable_table);
+
if (mapnum == 0)
{
tgt->tgt_start = 0;
@@ -1209,23 +1232,28 @@ gomp_map_vars_internal (struct gomp_device_descr
*devicep,
return tgt;
}
+ /* Initialize size tracking variables. */
tgt_align = sizeof (void *);
tgt_size = 0;
+
+ /* Prepare coalesce buffer. */
cbuf.chunks = NULL;
cbuf.chunk_cnt = -1;
cbuf.use_cnt = 0;
cbuf.buf = NULL;
- if (mapnum > 1 || (pragma_kind & GOMP_MAP_VARS_TARGET))
+ if (mapnum > 1 || need_tvt_alloc)
{
size_t chunks_size = (mapnum + 1) * sizeof (struct gomp_coalesce_chunk);
cbuf.chunks = (struct gomp_coalesce_chunk *) gomp_alloca (chunks_size);
cbuf.chunk_cnt = 0;
}
- if (pragma_kind & GOMP_MAP_VARS_TARGET)
+
+ if (need_tvt_alloc)
{
+ /* Prepare for allocating the target variable table. */
size_t align = 4 * sizeof (void *);
tgt_align = align;
- tgt_size = mapnum * sizeof (void *);
+ tgt_size = gomp_get_tvt_size (mapnum);
cbuf.chunk_cnt = 1;
cbuf.use_cnt = 1 + (mapnum > 1);
cbuf.chunks[0].start = 0;
@@ -1496,7 +1524,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
tgt->tgt_start = (uintptr_t) tgt->to_free;
tgt->tgt_end = tgt->tgt_start + sizes[0];
}
- else if (not_found_cnt || (pragma_kind & GOMP_MAP_VARS_TARGET))
+ else if (not_found_cnt || need_tvt_alloc || has_firstprivate)
{
/* Allocate tgt_align aligned tgt_size block of memory. */
/* FIXME: Perhaps change interface to allocate properly aligned
@@ -1534,7 +1562,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
}
tgt_size = 0;
- if (pragma_kind & GOMP_MAP_VARS_TARGET)
+ if (need_tvt_alloc)
tgt_size = mapnum * sizeof (void *);
tgt->array = NULL;
@@ -2080,10 +2108,16 @@ gomp_map_vars_internal (struct gomp_device_descr
*devicep,
if (!iterator_count || iterator_count[i] <= 1)
{
cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i);
- gomp_copy_host2dev (devicep, aq,
- (void *) (tgt->tgt_start + map_num * sizeof
(void *)),
- (void *) &cur_node.tgt_offset, sizeof (void *),
- true, cbufp);
+ if (need_tvt_alloc)
+ /* In this case, the table is on the device. */
+ gomp_copy_host2dev (devicep, aq,
+ (void *) (tgt->tgt_start + map_num * sizeof
(void *)),
+ (void *) &cur_node.tgt_offset, sizeof (void
*),
+ true, cbufp);
+ else
+ /* Otherwise, it's on the host. */
+ memcpy (&target_variable_table[map_num], &cur_node.tgt_offset,
+ sizeof (void *));
map_num++;
}
}
@@ -2133,7 +2167,8 @@ static struct target_mem_desc *
gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
bool short_mapkind, htab_t *refcount_set,
- enum gomp_map_vars_kind pragma_kind)
+ enum gomp_map_vars_kind pragma_kind,
+ void **target_variable_table)
{
/* This management of a local refcount_set is for convenience of callers
who do not share a refcount_set over multiple map/unmap uses. */
@@ -2147,7 +2182,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t
mapnum,
struct target_mem_desc *tgt;
tgt = gomp_map_vars_internal (devicep, NULL, mapnum, hostaddrs, devaddrs,
sizes, kinds, short_mapkind, refcount_set,
- pragma_kind);
+ pragma_kind, target_variable_table);
if (local_refcount_set)
htab_free (local_refcount_set);
@@ -2159,11 +2194,13 @@ goacc_map_vars (struct gomp_device_descr *devicep,
struct goacc_asyncqueue *aq, size_t mapnum,
void **hostaddrs, void **devaddrs, size_t *sizes,
void *kinds, bool short_mapkind,
- enum gomp_map_vars_kind pragma_kind)
+ enum gomp_map_vars_kind pragma_kind,
+ void **target_variable_table)
{
return gomp_map_vars_internal (devicep, aq, mapnum, hostaddrs, devaddrs,
sizes, kinds, short_mapkind, NULL,
- GOMP_MAP_VARS_OPENACC | pragma_kind);
+ GOMP_MAP_VARS_OPENACC | pragma_kind,
+ target_variable_table);
}
static void
@@ -3175,6 +3212,23 @@ gomp_get_target_fn_addr (struct gomp_device_descr
*devicep,
}
}
+/* Return number of char units that must be allocated for the host-size target
+ variable table for MAPNUM maps, if such a thing is supported by the device,
+ and if the amount would not exceed capabilities of the device. Returns zero
+ otherwise, in which case, the target variable table must be allocated on the
+ device.
+
+ See Note [Host-side target variable table]. */
+attribute_hidden size_t
+gomp_get_host_tvt_size_for_dev (struct gomp_device_descr *devicep,
+ size_t mapnum)
+{
+ size_t tvt_real_size = gomp_get_tvt_size (mapnum);
+ if (devicep->max_host_tvt_size < tvt_real_size)
+ return 0;
+ return tvt_real_size;
+}
+
/* Called when encountering a target directive. If DEVICE
is GOMP_DEVICE_ICV, it means use device-var ICV. If it is
GOMP_DEVICE_HOST_FALLBACK (or any value
@@ -3200,12 +3254,18 @@ GOMP_target (int device, void (*fn) (void *), const
void *unused,
|| !(fn_addr = gomp_get_target_fn_addr (devicep, fn)))
return gomp_target_fallback (fn, hostaddrs, devicep, NULL);
+ /* See Note [Host-side target variable table]. */
+ void **host_tvt = NULL;
+ size_t tvt_sz = gomp_get_host_tvt_size_for_dev (devicep, mapnum);
+ if (tvt_sz != 0)
+ host_tvt = alloca (tvt_sz);
+
htab_t refcount_set = htab_create (mapnum);
struct target_mem_desc *tgt_vars
= gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
- &refcount_set, GOMP_MAP_VARS_TARGET);
+ &refcount_set, GOMP_MAP_VARS_TARGET, host_tvt);
devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start,
- NULL);
+ NULL, tvt_sz);
htab_clear (refcount_set);
gomp_unmap_vars (tgt_vars, true, &refcount_set);
htab_free (refcount_set);
@@ -3524,6 +3584,8 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t
mapnum,
struct target_mem_desc *tgt_vars;
htab_t refcount_set = NULL;
+ void **host_tvt = NULL;
+ size_t tvt_sz = 0;
if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
{
if (!fpc_done)
@@ -3541,13 +3603,22 @@ GOMP_target_ext (int device, void (*fn) (void *),
size_t mapnum,
}
else
{
+ /* See Note [Host-side target variable table]. */
+ tvt_sz = gomp_get_host_tvt_size_for_dev (devicep, mapnum);
+ if (tvt_sz)
+ host_tvt = alloca (tvt_sz);
+
refcount_set = htab_create (mapnum);
tgt_vars = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds,
- true, &refcount_set, GOMP_MAP_VARS_TARGET);
+ true, &refcount_set, GOMP_MAP_VARS_TARGET,
+ host_tvt);
}
devicep->run_func (devicep->target_id, fn_addr,
- tgt_vars ? (void *) tgt_vars->tgt_start : hostaddrs,
- new_args);
+ host_tvt ? host_tvt
+ : tgt_vars ? (void *) tgt_vars->tgt_start
+ : hostaddrs,
+ new_args,
+ tvt_sz);
if (tgt_vars)
{
htab_clear (refcount_set);
@@ -4146,7 +4217,7 @@ gomp_target_data_fallback (struct gomp_device_descr
*devicep)
would get out of sync. */
struct target_mem_desc *tgt
= gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, false,
- NULL, GOMP_MAP_VARS_DATA);
+ NULL, GOMP_MAP_VARS_DATA, NULL);
tgt->prev = icv->target_data;
icv->target_data = tgt;
}
@@ -4165,7 +4236,7 @@ GOMP_target_data (int device, const void *unused, size_t
mapnum,
struct target_mem_desc *tgt
= gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
- NULL, GOMP_MAP_VARS_DATA);
+ NULL, GOMP_MAP_VARS_DATA, NULL);
struct gomp_task_icv *icv = gomp_icv (true);
tgt->prev = icv->target_data;
icv->target_data = tgt;
@@ -4184,7 +4255,7 @@ GOMP_target_data_ext (int device, size_t mapnum, void
**hostaddrs,
struct target_mem_desc *tgt
= gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, true,
- NULL, GOMP_MAP_VARS_DATA);
+ NULL, GOMP_MAP_VARS_DATA, NULL);
struct gomp_task_icv *icv = gomp_icv (true);
tgt->prev = icv->target_data;
icv->target_data = tgt;
@@ -4497,7 +4568,7 @@ GOMP_target_enter_exit_data (int device, size_t mapnum,
void **hostaddrs,
{
gomp_map_vars (devicep, sizes[i] + 1, &hostaddrs[i], NULL, &sizes[i],
&kinds[i], true, &refcount_set,
- GOMP_MAP_VARS_ENTER_DATA);
+ GOMP_MAP_VARS_ENTER_DATA, NULL);
i += sizes[i];
}
else if ((kinds[i] & 0xff) == GOMP_MAP_TO_PSET)
@@ -4508,7 +4579,7 @@ GOMP_target_enter_exit_data (int device, size_t mapnum,
void **hostaddrs,
break;
gomp_map_vars (devicep, j-i, &hostaddrs[i], NULL, &sizes[i],
&kinds[i], true, &refcount_set,
- GOMP_MAP_VARS_ENTER_DATA);
+ GOMP_MAP_VARS_ENTER_DATA, NULL);
i += j - i - 1;
}
else if (i + 1 < mapnum
@@ -4519,12 +4590,12 @@ GOMP_target_enter_exit_data (int device, size_t mapnum,
void **hostaddrs,
/* An attach operation must be processed together with the mapped
base-pointer list item. */
gomp_map_vars (devicep, 2, &hostaddrs[i], NULL, &sizes[i], &kinds[i],
- true, &refcount_set, GOMP_MAP_VARS_ENTER_DATA);
+ true, &refcount_set, GOMP_MAP_VARS_ENTER_DATA, NULL);
i += 1;
}
else
gomp_map_vars (devicep, 1, &hostaddrs[i], NULL, &sizes[i], &kinds[i],
- true, &refcount_set, GOMP_MAP_VARS_ENTER_DATA);
+ true, &refcount_set, GOMP_MAP_VARS_ENTER_DATA, NULL);
else
gomp_exit_data (devicep, mapnum, hostaddrs, sizes, kinds, &refcount_set);
htab_free (refcount_set);
@@ -4554,10 +4625,16 @@ gomp_target_task_fn (void *data)
{
if (ttask->tgt)
gomp_unmap_vars (ttask->tgt, true, NULL);
+
+ /* Allocated below. */
+ if (ttask->host_tvt)
+ free (ttask->host_tvt);
return false;
}
void *actual_arguments;
+ void **host_tvt = NULL;
+ size_t tvt_sz = 0;
if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
{
ttask->tgt = NULL;
@@ -4565,16 +4642,30 @@ gomp_target_task_fn (void *data)
}
else
{
+ /* See Note [Host-side target variable table]. */
+ tvt_sz = gomp_get_host_tvt_size_for_dev (devicep, ttask->mapnum);
+ if (tvt_sz)
+ ttask->host_tvt = host_tvt = malloc (tvt_sz);
+
ttask->tgt = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs,
NULL, ttask->sizes, ttask->kinds, true,
- NULL, GOMP_MAP_VARS_TARGET);
- actual_arguments = (void *) ttask->tgt->tgt_start;
+ NULL, GOMP_MAP_VARS_TARGET,
+ host_tvt);
+ if (host_tvt)
+ actual_arguments = host_tvt;
+ else
+ {
+ actual_arguments = (void *) ttask->tgt->tgt_start;
+ /* Maybe the allocation failed. Reset the size, just in
+ case. */
+ tvt_sz = 0;
+ }
}
ttask->state = GOMP_TARGET_TASK_READY_TO_RUN;
assert (devicep->async_run_func);
devicep->async_run_func (devicep->target_id, fn_addr, actual_arguments,
- ttask->args, (void *) ttask);
+ ttask->args, (void *) ttask, tvt_sz);
return true;
}
else if (devicep == NULL
@@ -4596,13 +4687,13 @@ gomp_target_task_fn (void *data)
{
gomp_map_vars (devicep, ttask->sizes[i] + 1, &ttask->hostaddrs[i],
NULL, &ttask->sizes[i], &ttask->kinds[i], true,
- &refcount_set, GOMP_MAP_VARS_ENTER_DATA);
+ &refcount_set, GOMP_MAP_VARS_ENTER_DATA, NULL);
i += ttask->sizes[i];
}
else
gomp_map_vars (devicep, 1, &ttask->hostaddrs[i], NULL,
&ttask->sizes[i],
&ttask->kinds[i], true, &refcount_set,
- GOMP_MAP_VARS_ENTER_DATA);
+ GOMP_MAP_VARS_ENTER_DATA, NULL);
else
gomp_exit_data (devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
ttask->kinds, &refcount_set);
@@ -6090,6 +6181,12 @@ gomp_load_plugin_for_device (struct gomp_device_descr
*device,
goto fail;
}
}
+
+ if (DLSYM_OPT (get_max_host_target_var_table_size,
+ get_max_host_target_var_table_size))
+ device->max_host_tvt_size =
device->get_max_host_target_var_table_size_func ();
+ else
+ device->max_host_tvt_size = 0;
#undef DLSYM
#undef DLSYM_OPT
diff --git a/libgomp/task.c b/libgomp/task.c
index cbba28516e3f..6e1559a84c76 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -951,6 +951,7 @@ gomp_create_target_task (struct gomp_device_descr *devicep,
ttask->devicep = devicep;
ttask->fn = fn;
ttask->mapnum = mapnum;
+ ttask->host_tvt = NULL;
memcpy (ttask->hostaddrs, hostaddrs, mapnum * sizeof (void *));
if (args_cnt)
{
diff --git
a/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
b/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
new file mode 100644
index 000000000000..7494c5a5f4c8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
@@ -0,0 +1,51 @@
+/* { dg-do run } */
+/* { dg-require-effective-target offload_device_gcn } */
+#include <assert.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+int
+main ()
+{
+ setenv ("GCN_DEBUG", "1", true);
+#ifdef INHIBIT_KERNARGS_MERGE
+ /* See gcn-kernel-launch-tvt-alloc.c */
+ setenv ("GCN_INHIBIT_KERNARGS_TVT_MERGE", "1", true);
+#endif
+
+ int i[1] = { 0 };
+
+#pragma omp target data map(tofrom: i[0:1])
+ {
+ fprintf (stderr, "================\n");
+
+#pragma omp target
+ { i[0] = 1; }
+ }
+
+ assert (i[0] == 1);
+}
+
+/* Here, we want to ensure that we have no allocations after the point
+ delimited by ===...
+
+ Past that point, the only data to map onto the device is the target
+ variable table, which should be passed as kernel arguments. The GCN plugin
+ currently does not log allocating those. We rely on that here.
+
+ So, dg-output lets us match the entire output with a regex. Multiple
+ dg-output invocations will have their regexes concatenated in order. The
+ following is that regex, broken down by function:
+
+ Ignore ===... marker and everything before it.
+ { dg-output {^.*================[\r\n]+} }
+ Then, each further line is either...
+ { dg-output {((} }
+ ... a line not starting with "GCN debug: "...
+ { dg-output {(?!GCN debug:)[^\r\n]+} }
+ ... or a "GCN debug: ..." line that is not an allocation:
+ { dg-output {|GCN debug: (?!Allocating )[^\r\n]*} }
+ ... followed by a line terminator, of course.
+ { dg-output {)[\r\n]+)*} }
+ There should be nothing left.
+ { dg-output {$} } */
diff --git
a/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c
b/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c
new file mode 100644
index 000000000000..ab5ed2dc4336
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-require-effective-target offload_device_gcn } */
+
+/* Test that the no-merge case still works. */
+
+#define INHIBIT_KERNARGS_MERGE
+#include "./gcn-kernel-launch-no-tvt-alloc.c"
+
+/* See commentary in <gcn-kernel-launch-no-tvt-alloc.c>.
+
+ Ignore ===... marker and everything before it.
+ { dg-output {^.*================[\r\n]+} }
+ We expect at least "GCN debug: Allocating \d+ bytes..."
+ { dg-output {.*[\r\n]+GCN debug: Allocating \d+ bytes.*} }
+ There should be nothing left.
+ { dg-output {$} } */
--
2.53.0