On 05/05/2026 14:14, Arsen Arsenović wrote:
In my examination of BabelStream results on AMD GCN, I've found that,
for each BabelStream kernel execution, we spend significant time in
allocating and initializing memory in gomp_map_vars (~55µs, whereas the
actual BabelStream code executes in ~746µs, meaning we increase the time
BabelStream measures by 7% just on that).
Upon further examination, I've found that the only reason gomp_map_vars
decides to allocate and map any memory in the first place is because it
is constructing the table of pointers to variables on the target, which
I've taken to calling the "target variable table". Given that the GCN
plugin already must perform some memory allocation before starting up a
kernel, namely to allocate kernel arguments, it would be beneficial if
we could merge this allocation with the kernel arguments allocation.
In addition, since the kernel arguments live in host memory, populating
them can be performed using string functions, without any need to call
for expensive host2dev copies.
This patch introduces an opaque type for "offload sessions". This type
is defined by each plugin and allows it to store data related to a
single offload job. The sessions are allocated and managed by libgomp,
and initialized and utilized by the plugin. Their lifetime starts with
a call to GOMP_OFFLOAD_session_start, and ends with
GOMP_OFFLOAD_{openacc_{async_,}exec,{async_,}run}.
The patch then uses this framework to make management of the target
variable table more flexible: the plugin may elect to implement
GOMP_OFFLOAD_session_allocate_target_var_table, which allows the plugin
to attempt to allocate the target variable table in host memory.
Even though this patch is part of a "GCN" patch series, I think this
patch needs review by a libgomp maintainer (CC'd).
However, see my comment below....
If it fails, or if the plugin does not provide this function, libgomp
will perform this allocation as it does today - in target memory - and
tell the session about it using
GOMP_OFFLOAD_session_set_target_var_table.
In the case of AMD GCN, upon a call to
GOMP_OFFLOAD_session_allocate_target_var_table, the plugin will
immediately allocate kernel arguments with enough space for the target
variable table, no matter what size the plugin asks for[1], and return
that pointer to libgomp.
This results in the runtime of gomp_map_vars effectively disappearing
from traces.
[1] It may be beneficial to limit this, to some fixed amount, to make it
so that the future allocation cache has a higher cache hit rate. It
may also depend on whether hsa_memory_allocate for kernel arguments
takes runtime proportional to the number of bytes it needs to
allocate.
include/ChangeLog:
+/* Get new kernargs for SESSION such that it can store TABLE_SIZE char units of
+ target variable table, reusing cached kernargs allocations, if possible. */
+
+static inline struct kernargs *
+allocate_session_kernargs (struct gomp_offload_session *session,
+ size_t table_size)
+{
+ GCN_DEBUG ("Session %p asked for allocation of kernargs+%zu...\n", session,
table_size);
+ struct agent_info *agent = session->agent;
+ assert (!session->kernarg_cache_node);
+
+ /* To increase chance of cache hit, round up size of the target variable
+ table to a multiple of (64*sizeof(void*)), and ensure that this size is
+ nonzero. */
+ if (!table_size)
+ table_size++;
+
+ {
+ constexpr size_t rounding_factor = 64 * sizeof (void*);
+ table_size += rounding_factor - 1;
+ table_size = (table_size / rounding_factor) * table_size;
+ }
This looks wrong. You probably don't mean to multiply by table_size there.
Andrew