Re: [PATCH v2 5/6] libgomp, nvptx: Cuda pinned memory

Tobias Burnus Wed, 22 Nov 2023 18:15:44 -0800

(I have not fully thought about the 2/6, 3/6 and 4/6 patches, but I
think except for some patch apply issues, 1/6 + this 5/6 can be both
committed without needing 2-4.)


On 23.08.23 16:14, Andrew Stubbs wrote:

Use Cuda to pin memory, instead of Linux mlock, when available.

There are two advantages: firstly, this gives a significant speed boost for
NVPTX offloading, and secondly, it side-steps the usual OS ulimit/rlimit
setting.

I think both the ulimit issue for the 1/6 patch and the non-issue for
this variant should/could be mentioned in libgomp.texi (in the Memory
Management section and in the nvptx section, respectively.)

The design adds a device independent plugin API for allocating pinned memory,
and then implements it for NVPTX.  At present, the other supported devices do
not have equivalent capabilities (or requirements).


Note before: Starting with TR11 alias OpenMP 6.0, OpenMP supports handling
multiple devices for allocation. It seems as if after using:

  my_memspace = omp_get_device_and_host_memspace( 5 , omp_default_mem_space)
  my_alloc = omp_init_allocator (my_memspace, my_traits_with_pinning);

The pinning should be done via device '5' is possible.

 * * *

However, I believe that it shouldn't really matter for now, given that CUDA
has no special handling of NUMA hierarchy on the host nor for specific devices
and GCN has none.

It only becomes interesting if mmap/mlock memory is (measurably) faster than
CUDA allocated memory when accessed from the host or, for USM, from GCN.

* * *

Let's start with the patch itself:

--- a/libgomp/target.c +++ b/libgomp/target.c ... +static struct
gomp_device_descr * +get_device_for_page_locked (void) +{ + gomp_debug
(0, "%s\n", + __FUNCTION__); + + struct gomp_device_descr *device;
+#ifdef HAVE_SYNC_BUILTINS + device + = __atomic_load_n
(&device_for_page_locked, MEMMODEL_RELAXED); + if (device == (void *)
-1) + { + gomp_debug (0, " init\n"); + + gomp_init_targets_once (); +
+ device = NULL; + for (int i = 0; i < num_devices; ++i)


Given that this function just sets a single variable based on whether the
page_locked_host_alloc_func function pointer exists, wouldn't it be much
simpler to just do all this handling in   gomp_target_init  ?

+ for (int i = 0; i < num_devices; ++i) ... + /* We consider only the
first device of potentially several of the + same type as this
functionality is not specific to an individual + offloading device,
but instead relates to the host-side + implementation of the
respective offloading implementation. */ + if (devices[i].target_id !=
0) + continue; + + if (!devices[i].page_locked_host_alloc_func) +
continue; ... + if (device) + gomp_fatal ("Unclear how %s and %s
libgomp plugins may" + " simultaneously provide functionality" + " for
page-locked memory", + device->name, devices[i].name); + else + device
= &devices[i];


I find this a bit inconsistent: If - let's say - GCN does not not provide its
own pinning, the code assumes that CUDA pinning is just fine.  However, if both
support it, CUDA pinning suddenly is not fine for GCN.

Additionally, all wording suggests that it does not matter for CUDA for which
device access we want to optimize the pinning. But the code above also fails if
I have a system with two Nvidia cards.  From the wording, it sounds as if just
checking whether the  device->type  is different would do.


But all in all, I wonder whether it wouldn't be much simpler to state something
like the following (where applicable):

If first device that provided pinning support is used; the assumption is that
all other devices and the host can access this memory without measurable
performance penalty compared to a normal page lock and that having multiple
device types or host/device NUMA aware pinning support in the plugin is not
available.
NOTE: For OpenMP 6.0's OMP_AVAILABLE_DEVICES environment variable, device-set
memory spaces this might need to be revisited.

(The note is only meant for the *.c code / this review, the first sentence up
to the ';' should do in some way into libgomp.texi as well.)

And document in libgomp.texi that the first device for which device-specific
host-memory pinning support is available is used → to be added in
https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html
The nvidia specific part - i.e. that it is supported and possibly more details -
can then be added to: https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html

I think the @ref added to 'Offload-Target Specifics' for the third time for
this patch will be sufficient - if not, add some referring words as well.


 * * *

+gomp_page_locked_host_alloc (void **ptr, size_t size) +{ + gomp_debug
(0, "%s: ptr=%p, size=%llu\n", + __FUNCTION__, ptr, (unsigned long
long) size); + + struct gomp_device_descr *device =
get_device_for_page_locked ();


With the proposed changes above, we could just call
  device_for_page_locked
and then access the global variable nonatomically.

BTW: I think the global variable needs a _mem(ory) suffix, I find
  device_for_page_locked
somewhat unclear. I also wonder whether ..._pinned_mem(ory) is clearer
than ..._paged_locked_mem(ory), but I think either works.

 * * *

I was wondering whether there should be a very noisy fail (error printed +
gomp_fatal by the caller) or not:

+++ b/libgomp/plugin/plugin-nvptx.c
+GOMP_OFFLOAD_page_locked_host_alloc (void **ptr, size_t size) ... + r
= CUDA_CALL_NOCHECK (cuMemHostAlloc, ptr, size, flags); + if (r ==
CUDA_ERROR_OUT_OF_MEMORY) + *ptr = NULL; + else if (r != CUDA_SUCCESS)
+ { + GOMP_PLUGIN_error ("cuMemHostAlloc error: %s", cuda_error (r));


The possible return values are:
  CUDA_SUCCESS
  CUDA_ERROR_DEINITIALIZED
  CUDA_ERROR_NOT_INITIALIZED
  CUDA_ERROR_INVALID_CONTEXT
  CUDA_ERROR_INVALID_VALUE
  CUDA_ERROR_OUT_OF_MEMORY

=> I think the current handling is okay - if and only if 'size = 0' is
handled gracefully by returning 0.  (Either by this function or by all its
callers.) For omp_alloc, OpenMP states:
    "If size is 0, omp_alloc and omp_aligned_alloc will return NULL."

(I have the feeling that CUDA might return CUDA_ERROR_INVALID_VALUE in
that case.)

* * *

Side remark:
   omp_alloc (my_alloc, 0)
should return NULL - and not use any fallback. I am pretty sure that is 
explicitly
handled in omp_alloc. I think omp_alloc is also called in all relevant cases 
for other
codepaths, except for omp_realloc which has/needs some extra handling, but I 
think that's
done in the generic code in libgomp/allocator.c

* * *

--- a/libgomp/config/linux/allocator.c +++
b/libgomp/config/linux/allocator.c ... -linux_memspace_alloc
(omp_memspace_handle_t memspace, size_t size, int pin)
+linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size,
int pin, + bool init0) { ... else - return malloc (size); + addr =
malloc (size); + + if (addr && init0) + { + gomp_debug (0, "
init0\n"); + memset (addr, 0, size); + }


This code looks very puzzling - I think it is beast to create an auxiliary 
function,
e.g., linux_memspace_alloc_pinned, and call that from both
  linux_memspace_alloc
and
  linux_memspace_calloc

That avoids the 'if (pin)' in the aux function and the odd case that (for 
'if(pin)')
calloc calls into a function that at the end might (for '!pin') call malloc 
instead
of calloc - followed by an init0.

* * *

--- a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c +++
b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c +/* {
dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target
offload_device_nvptx } } */


I think this is sufficiently fine. But for completeness, I think this is not 
100%
reliable. The check is for the default device, which is by default the first 
available
device.  For the tested-for feature, it is sufficient if *any* available device 
is an
nvptx device. - If you have a GCN and a Nvptx GPU and the GCN one comes first, 
CUDA's
allocator will be used - but the OFFLOAD_DEVICE_NVPTX is unset. Likewise for 
GCN + NVPTX
and some OMP_DEFAULT_DEVICE=... trickery.

I think a host with both nvptx + gcn is rare - and if it becomes an issue, we 
just need
to extend the effective-target checks.

* * *

@@ -67,10 +69,15 @@ verify0 (char *p, size_t s) int main () { +#ifdef
OFFLOAD_DEVICE_NVPTX + /* Go big or go home. */ + const int SIZE = 40
* 1024 * 1024; +#else


I wonder whether it makes sense to add the comment that the 'ulimit' does not
affect nvptx's pinned-memory allocator, just for completeness and as service for
the test-case reader.

Otherwise, I have not spotted anything.

Tobias

PS: Next comes "[PATCH v2 6/6] libgomp: fine-grained pinned memory allocator" to
complete the library-only bit of this patch series.

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Re: [PATCH v2 5/6] libgomp, nvptx: Cuda pinned memory

Reply via email to