Now pushed as r16-3925-gda5803c794d16d.
There is some follow-up cleanup needed to at least free the memory on
the host, but possibly also to handle multiple image loads/unloads.
See PRs for detail: PR114445, PR119857 – and PR114690.
I will close PRs and update the description, such that likely only the
third one will track this.
Tobias
Tobias Burnus wrote:
Minor update (as attached): Fixed two bugs in plugin-gcn.c.
The patch has now also been tested with AMD GPUs (MI300) offloading
and on a PowerPC (powerpc64le) system with offloading
to a Volta GPU.
Tobias Burnus wrote:
The main motivation for this patch is that on MI300,
libgomp.c-c++-common/declare-target-indirect-2.c always fails
because of the known fragile/racy code - as noticed before
(PR114445 and PR119857), albeit triggering the fail has become
more reliably.
'indirect' means that such-tagged functions are added to a
lookup table on the device - and when invoking a function
through a function pointer, a lookup is done. If matched,
the looked-up address is used - otherwise, the original one.
That permits to call the device function associated with a
host function.
Currently, the check whether the hash table has to be build
happens every time in a target region - but only for the first
team. However, that's not multi-team safe: Either by launching
multiple teams in one kernel (+ doing a lookup early in
a team != first one) – or by launching one with nowait, quickly
followed by another one.
The new approach is to instead create the hash table on the host
during GOMP_OFFLOAD_load_image - which the attached patch
implements. (Also on the pro side, it also avoids one
device-side allocation call and uses a bit less memory on
the device.)
The current patch is backward compatible (both ways); I would
like to remove the fallback on the device side (i.e. binary with
new libgomp on the device side running with older host libgomp.so),
but I guess we can't do this, can we?
RFC — Any comments, remarks or concerns?
* * *
Notes:
* The patch has only been tested on x86-64-gnu-linux with
an Nvidia GPU. Testing with an AMD GPU and offloading
from PowerPC or Aarch64 (Grace) to an Nvidia GPU still
needs to be done.
* Currently, the hash is created on the host, copied to the
device and - done. There is for sure a memory leak of the
hash table itself on the host as the pointer to it is
neither freed nor stored somewhere.
[And related:]
* This patch also does not address PR114690 - namely,
loading multiple images - or unloading images.
Loading multiple kernels is fine if at most one has
an indirect function table. (Same as before.)
Technically, replacing the hash table by a new one
with additional entries - or just adding more if there
is still enough space - relatively simple. (But requires,
see above, storing some data on the host - at least the
pointer to the current host hash table.)
It is less clear to me how unloading an image is supposed
to be handled - both one image (but not all) and all. In
most programs, there is only one image - and in particular,
images only get unloaded at the very end. - Thus, just
leaving the mess on the device is fine - most of the time.
The question is how to deal with the corner cases?
Knowing the addresses, one can remove all indirect entries
one by one (or all, if known that there is only one image
that added it) - but doing so takes a while and, if not
required, I really would like to avoid it!
Currently, nearly all code will have no indirect functions.
With OpenMP 5.2 support for virtual functions in C++, I think
the number of code which will have them will increase, but
still it is expected that there is only one image - and not
multiple.
Thoughts how to handle this part best? Or general comments?