(I have not fully thought about the 2/6, 3/6 and 4/6 patches, but I think except for some patch apply issues, 1/6 + this 5/6 can be both committed without needing 2-4.)
On 23.08.23 16:14, Andrew Stubbs wrote:
Use Cuda to pin memory, instead of Linux mlock, when available. There are two advantages: firstly, this gives a significant speed boost for NVPTX offloading, and secondly, it side-steps the usual OS ulimit/rlimit setting.
I think both the ulimit issue for the 1/6 patch and the non-issue for this variant should/could be mentioned in libgomp.texi (in the Memory Management section and in the nvptx section, respectively.)
The design adds a device independent plugin API for allocating pinned memory, and then implements it for NVPTX. At present, the other supported devices do not have equivalent capabilities (or requirements).
Note before: Starting with TR11 alias OpenMP 6.0, OpenMP supports handling multiple devices for allocation. It seems as if after using: my_memspace = omp_get_device_and_host_memspace( 5 , omp_default_mem_space) my_alloc = omp_init_allocator (my_memspace, my_traits_with_pinning); The pinning should be done via device '5' is possible. * * * However, I believe that it shouldn't really matter for now, given that CUDA has no special handling of NUMA hierarchy on the host nor for specific devices and GCN has none. It only becomes interesting if mmap/mlock memory is (measurably) faster than CUDA allocated memory when accessed from the host or, for USM, from GCN. * * * Let's start with the patch itself:
--- a/libgomp/target.c +++ b/libgomp/target.c ... +static struct gomp_device_descr * +get_device_for_page_locked (void) +{ + gomp_debug (0, "%s\n", + __FUNCTION__); + + struct gomp_device_descr *device; +#ifdef HAVE_SYNC_BUILTINS + device + = __atomic_load_n (&device_for_page_locked, MEMMODEL_RELAXED); + if (device == (void *) -1) + { + gomp_debug (0, " init\n"); + + gomp_init_targets_once (); + + device = NULL; + for (int i = 0; i < num_devices; ++i)
Given that this function just sets a single variable based on whether the page_locked_host_alloc_func function pointer exists, wouldn't it be much simpler to just do all this handling in gomp_target_init ?
+ for (int i = 0; i < num_devices; ++i) ... + /* We consider only the first device of potentially several of the + same type as this functionality is not specific to an individual + offloading device, but instead relates to the host-side + implementation of the respective offloading implementation. */ + if (devices[i].target_id != 0) + continue; + + if (!devices[i].page_locked_host_alloc_func) + continue; ... + if (device) + gomp_fatal ("Unclear how %s and %s libgomp plugins may" + " simultaneously provide functionality" + " for page-locked memory", + device->name, devices[i].name); + else + device = &devices[i];
I find this a bit inconsistent: If - let's say - GCN does not not provide its own pinning, the code assumes that CUDA pinning is just fine. However, if both support it, CUDA pinning suddenly is not fine for GCN. Additionally, all wording suggests that it does not matter for CUDA for which device access we want to optimize the pinning. But the code above also fails if I have a system with two Nvidia cards. From the wording, it sounds as if just checking whether the device->type is different would do. But all in all, I wonder whether it wouldn't be much simpler to state something like the following (where applicable): If first device that provided pinning support is used; the assumption is that all other devices and the host can access this memory without measurable performance penalty compared to a normal page lock and that having multiple device types or host/device NUMA aware pinning support in the plugin is not available. NOTE: For OpenMP 6.0's OMP_AVAILABLE_DEVICES environment variable, device-set memory spaces this might need to be revisited. (The note is only meant for the *.c code / this review, the first sentence up to the ';' should do in some way into libgomp.texi as well.) And document in libgomp.texi that the first device for which device-specific host-memory pinning support is available is used → to be added in https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html The nvidia specific part - i.e. that it is supported and possibly more details - can then be added to: https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html I think the @ref added to 'Offload-Target Specifics' for the third time for this patch will be sufficient - if not, add some referring words as well. * * *
+gomp_page_locked_host_alloc (void **ptr, size_t size) +{ + gomp_debug (0, "%s: ptr=%p, size=%llu\n", + __FUNCTION__, ptr, (unsigned long long) size); + + struct gomp_device_descr *device = get_device_for_page_locked ();
With the proposed changes above, we could just call device_for_page_locked and then access the global variable nonatomically. BTW: I think the global variable needs a _mem(ory) suffix, I find device_for_page_locked somewhat unclear. I also wonder whether ..._pinned_mem(ory) is clearer than ..._paged_locked_mem(ory), but I think either works. * * * I was wondering whether there should be a very noisy fail (error printed + gomp_fatal by the caller) or not:
+++ b/libgomp/plugin/plugin-nvptx.c +GOMP_OFFLOAD_page_locked_host_alloc (void **ptr, size_t size) ... + r = CUDA_CALL_NOCHECK (cuMemHostAlloc, ptr, size, flags); + if (r == CUDA_ERROR_OUT_OF_MEMORY) + *ptr = NULL; + else if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("cuMemHostAlloc error: %s", cuda_error (r));
The possible return values are: CUDA_SUCCESS CUDA_ERROR_DEINITIALIZED CUDA_ERROR_NOT_INITIALIZED CUDA_ERROR_INVALID_CONTEXT CUDA_ERROR_INVALID_VALUE CUDA_ERROR_OUT_OF_MEMORY => I think the current handling is okay - if and only if 'size = 0' is handled gracefully by returning 0. (Either by this function or by all its callers.) For omp_alloc, OpenMP states: "If size is 0, omp_alloc and omp_aligned_alloc will return NULL." (I have the feeling that CUDA might return CUDA_ERROR_INVALID_VALUE in that case.) * * * Side remark: omp_alloc (my_alloc, 0) should return NULL - and not use any fallback. I am pretty sure that is explicitly handled in omp_alloc. I think omp_alloc is also called in all relevant cases for other codepaths, except for omp_realloc which has/needs some extra handling, but I think that's done in the generic code in libgomp/allocator.c * * *
--- a/libgomp/config/linux/allocator.c +++ b/libgomp/config/linux/allocator.c ... -linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin) +linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin, + bool init0) { ... else - return malloc (size); + addr = malloc (size); + + if (addr && init0) + { + gomp_debug (0, " init0\n"); + memset (addr, 0, size); + }
This code looks very puzzling - I think it is beast to create an auxiliary function, e.g., linux_memspace_alloc_pinned, and call that from both linux_memspace_alloc and linux_memspace_calloc That avoids the 'if (pin)' in the aux function and the odd case that (for 'if(pin)') calloc calls into a function that at the end might (for '!pin') call malloc instead of calloc - followed by an init0. * * *
--- a/libgomp/testsuite/libgomp.c/alloc-pinned-1.c +++ b/libgomp/testsuite/libgomp.c/alloc-pinned-1.c +/* { dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target offload_device_nvptx } } */
I think this is sufficiently fine. But for completeness, I think this is not 100% reliable. The check is for the default device, which is by default the first available device. For the tested-for feature, it is sufficient if *any* available device is an nvptx device. - If you have a GCN and a Nvptx GPU and the GCN one comes first, CUDA's allocator will be used - but the OFFLOAD_DEVICE_NVPTX is unset. Likewise for GCN + NVPTX and some OMP_DEFAULT_DEVICE=... trickery. I think a host with both nvptx + gcn is rare - and if it becomes an issue, we just need to extend the effective-target checks. * * *
@@ -67,10 +69,15 @@ verify0 (char *p, size_t s) int main () { +#ifdef OFFLOAD_DEVICE_NVPTX + /* Go big or go home. */ + const int SIZE = 40 * 1024 * 1024; +#else
I wonder whether it makes sense to add the comment that the 'ulimit' does not affect nvptx's pinned-memory allocator, just for completeness and as service for the test-case reader. Otherwise, I have not spotted anything. Tobias PS: Next comes "[PATCH v2 6/6] libgomp: fine-grained pinned memory allocator" to complete the library-only bit of this patch series. ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955