Arsen Arsenović <[email protected]> writes:

> In my examination of BabelStream results on AMD GCN, I've found that,
> for each BabelStream kernel execution, we spend significant time in
> allocating and initializing memory in gomp_map_vars (~55µs, whereas the
> actual BabelStream code executes in ~746µs, meaning we increase the time
> BabelStream measures by 7% just on that).
>
> Upon further examination, I've found that the only reason gomp_map_vars
> decides to allocate and map any memory in the first place is because it
> is constructing the table of pointers to variables on the target, which
> I've taken to calling the "target variable table".  Given that the GCN
> plugin already must perform some memory allocation before starting up a
> kernel, namely to allocate kernel arguments, it would be beneficial if
> we could merge this allocation with the kernel arguments allocation.
>
> In addition, since the kernel arguments live in host memory, populating
> them can be performed using string functions, without any need to call
> for expensive host2dev copies.
>
> This patch introduces an opaque type for "offload sessions".  This type
> is defined by each plugin and allows it to store data related to a
> single offload job.  The sessions are allocated and managed by libgomp,
> and initialized and utilized by the plugin.  Their lifetime starts with
> a call to GOMP_OFFLOAD_session_start, and ends with
> GOMP_OFFLOAD_{openacc_{async_,}exec,{async_,}run}.
>
> The patch then uses this framework to make management of the target
> variable table more flexible: the plugin may elect to implement
> GOMP_OFFLOAD_session_allocate_target_var_table, which allows the plugin
> to attempt to allocate the target variable table in host memory.
>
> If it fails, or if the plugin does not provide this function, libgomp
> will perform this allocation as it does today - in target memory - and
> tell the session about it using
> GOMP_OFFLOAD_session_set_target_var_table.
>
> In the case of AMD GCN, upon a call to
> GOMP_OFFLOAD_session_allocate_target_var_table, the plugin will
> immediately allocate kernel arguments with enough space for the target
> variable table, no matter what size the plugin asks for[1], and return
> that pointer to libgomp.
>
> This results in the runtime of gomp_map_vars effectively disappearing
> from traces.
>
> [1] It may be beneficial to limit this, to some fixed amount, to make it
>     so that the future allocation cache has a higher cache hit rate.  It
>     may also depend on whether hsa_memory_allocate for kernel arguments
>     takes runtime proportional to the number of bytes it needs to
>     allocate.
>
> include/ChangeLog:
>
>       * gomp-constants.h (GOMP_VERSION): Bump.  Signature of
>       GOMP_OFFLOAD_run et al changed.
>
> libgomp/ChangeLog:
>
>       * libgomp-plugin.h (GOMP_OFFLOAD_run, GOMP_OFFLOAD_exec)
>       (GOMP_OFFLOAD_async_run, GOMP_OFFLOAD_openacc_async_exec): Pass
>       session in place of target variable table and devices.
>       (struct gomp_offload_session): New.
>       (GOMP_OFFLOAD_session_size): New
>       (GOMP_OFFLOAD_check_session_struct): New.
>       (GOMP_OFFLOAD_session_boilerplate): New.
>       (GOMP_OFFLOAD_session_start): New.
>       (GOMP_OFFLOAD_session_allocate_target_var_table): New.
>       (GOMP_OFFLOAD_session_set_target_var_table): New.
>       * libgomp.h (struct gomp_target_task): Add offload_session
>       field.
>       (struct gomp_device_descr): Add offload session management
>       functions.
>       (gomp_offload_session_new): New.
>       (goacc_map_vars): Add SESSION to signature
>       * oacc-host.c (struct gomp_offload_session): Define, for host
>       offload fallback case.
>       (host_session_size): New.  Implements GOMP_OFFLOAD_session_size.
>       (host_session_start): New.  Implements
>       GOMP_OFFLOAD_session_start.
>       (host_session_set_target_var_table): New.  Implements
>       GOMP_OFFLOAD_session_set_target_var_table.
>       (host_run): Adjust to match GOMP_OFFLOAD_run.
>       (host_openacc_exec): Adjust to match GOMP_OFFLOAD_openacc_exec.
>       (host_openacc_async_exec): Adjust to match
>       GOMP_OFFLOAD_openacc_async_exec.
>       * oacc-mem.c (acc_map_data): Adjust call to goacc_map_vars.
>       (goacc_enter_datum): Ditto.
>       (goacc_enter_data_internal): Ditto.
>       * oacc-parallel.c (GOACC_parallel_keyed): Allocate and pass
>       offload session.
>       (GOACC_data_start): Adjust call to goacc_map_vars.
>       * plugin/plugin-gcn.c (struct kernel_dispatch): Remove
>       kernarg_cache_node.
>       (struct kernargs): Add a flexible array member for the target
>       variable table.
>       (struct kernel_launch): Store an offload session rather than
>       target var. table pointer.
>       (print_kernel_dispatch): Receive kernargs as parameter.
>       (struct gomp_offload_session): Define.
>       (init_session): New.
>       (GOMP_OFFLOAD_session_start): Implement, using init_session.
>       (release_session): New.
>       (alloc_kernargs_on_agent): Rename to...
>       (allocate_session_kernargs): ... this, store result in
>       passed-in SESSION, and allocate extra room for target variable
>       table (rounding it up to nearest multiple of 64 pointers).
>       (GOMP_OFFLOAD_session_allocate_target_var_table): Implement
>       using the previous function.
>       (GOMP_OFFLOAD_session_set_target_var_table): Ditto.
>       (create_kernel_dispatch): Remove kernarg allocation, instead
>       receiving it as an argument.
>       (release_kernel_dispatch): Receive kernargs as an argument,
>       don't release them.
>       (run_kernel): Adjust to use sessions.
>       (destroy_module): Ditto.
>       (GOMP_OFFLOAD_load_image): Ditto.
>       (execute_queue_entry): Adjust to match changed struct
>       kernel_launch.
>       (queue_push_launch): Ditto.
>       (gcn_exec): Receive and pass along session.
>       (GOMP_OFFLOAD_run): Ditto.
>       (GOMP_OFFLOAD_async_run): Ditto.
>       (GOMP_OFFLOAD_openacc_exec): Ditto.
>       (GOMP_OFFLOAD_openacc_async_exec): Ditto.
>       * plugin/plugin-nvptx.c (struct gomp_offload_session): Define.
>       (GOMP_OFFLOAD_session_start): Implement.
>       (GOMP_OFFLOAD_session_set_target_var_table): Implement.
>       (GOMP_OFFLOAD_openacc_exec): Adjust to receive session.
>       (GOMP_OFFLOAD_openacc_async_exec): Ditto.
>       (GOMP_OFFLOAD_run): Ditto.
>       * target.c (gomp_get_tvt_size): Extract helper from...
>       (gomp_map_vars_internal): ... here.  Receive SESSION, iff doing
>       target offload.  Use a target variable table on the host
>       allocated by GOMP_OFFLOAD_session_allocate_target_var_table if
>       possible, or call GOMP_OFFLOAD_session_set_target_var_table with
>       an allocated device pointer otherwise.
>       (gomp_map_vars): Update to pass along session.
>       (goacc_map_vars): Ditto.
>       (GOMP_target): Allocate and pass along session.
>       (GOMP_target_ext): Ditto.
>       (gomp_target_data_fallback): Adjust call to gomp_map_vars.
>       (GOMP_target_data): Ditto.
>       (GOMP_target_data_ext): Ditto.
>       (GOMP_target_enter_exit_data): Ditto.
>       (gomp_target_task_fn): Start and pass along session, the storage
>       for which is allocated by gomp_create_target_task.
>       (DLSYM2): Rename from DLSYM, adding a new parameter for the
>       variable to populate, akin to DLSYM_OPT.
>       (DLSYM): Delegate to DLSYM2.
>       (gomp_load_plugin_for_device): Populate session-related fields.
>       * task.c (gomp_create_target_task): Allocate enough storage for
>       an offload session.
>       * testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c: New 
> test.
>       * testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c: New 
> test.
> ---
>  include/gomp-constants.h                      |   2 +-
>  libgomp/libgomp-plugin.h                      |  81 +++++-
>  libgomp/libgomp.h                             |  27 +-
>  libgomp/oacc-host.c                           |  63 ++++-
>  libgomp/oacc-mem.c                            |   8 +-
>  libgomp/oacc-parallel.c                       |  24 +-
>  libgomp/plugin/plugin-gcn.c                   | 254 ++++++++++++------
>  libgomp/plugin/plugin-nvptx.c                 |  45 +++-
>  libgomp/target.c                              | 191 ++++++++-----
>  libgomp/task.c                                |  33 ++-
>  .../gcn-kernel-launch-no-tvt-alloc.c          |  51 ++++
>  .../gcn-kernel-launch-tvt-alloc.c             |  16 ++
>  12 files changed, 604 insertions(+), 191 deletions(-)
>  create mode 100644 
> libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
>  create mode 100644 
> libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c

Ping.
-- 
Arsen Arsenović

Attachment: signature.asc
Description: PGP signature

Reply via email to