Hi Tom, this patch adjusts nvptx_free() in libgomp/plugin/plugin-nvptx.c to avoid a "GOMP_PLUGIN_acc_thread() == NULL" check that was causing problems under OpenMP offloading.
This check was originally used to determine if nvptx_free() was running under CUDA callback context, when freeing resources from an OpenACC asynchronous compute region. Since CUDA API calls are not allowed inside callback context, we have to save the freed block to ptx_dev->free_blocks, and cuMemFree it later. The check to see if GOMP_PLUGIN_acc_thread() exists to determine normal host thread vs. callback thread worked under -fopenacc, but since the OpenACC per-thread data is not created under -fopenmp, and always returned NULL, we have a leak situation where OpenMP offloading kept accumulating freed device memory blocks until failing; nvptx_free() never reaches the part where cuMemFree() is actually called. I reviewed the CUDA API docs and see that CUDA_ERROR_NOT_PERMITTED is returned for such CUDA calls inside callback context, and it appears to be enough to replace the current check, so the new code sees if this error is returned on the first cuMemGetAddressRange call to determine callback context. This should now work for both OpenACC/OpenMP. (Tobias, Catherine, the earlier internal patch to re-organize this callback context checking did not work in general, since OpenACC also uses the .queue_callback functionality to free the struct target_mem_desc asynchronously, so in general we have to ensure nvptx_free() could be used under both normal/callback context) This patch has been libgomp tested for x86_64-linux with nvptx offloading without regressions, and should be applied for mainline and GCC10. Is this okay? Thanks, Chung-Lin 2020-08-20 Chung-Lin Tang <clt...@codesourcery.com> libgomp/ * plugin/plugin-nvptx.c (nvptx_free): Change "GOMP_PLUGIN_acc_thread () == NULL" test into check of CUDA_ERROR_NOT_PERMITTED status for cuMemGetAddressRange. Adjust comments.
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index ec103a2f40b..188a34f1d04 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -1038,27 +1038,34 @@ goacc_profiling_acc_ev_free (struct goacc_thread *thr, void *p) } static bool nvptx_free (void *p, struct ptx_device *ptx_dev) { - /* Assume callback context if this is null. */ - if (GOMP_PLUGIN_acc_thread () == NULL) + CUdeviceptr pb; + size_t ps; + + CUresult r = CUDA_CALL_NOCHECK (cuMemGetAddressRange, &pb, &ps, + (CUdeviceptr) p); + if (r == CUDA_ERROR_NOT_PERMITTED) { + /* We assume that this error indicates we are in a CUDA callback context, + where all CUDA calls are not allowed. Arrange to free this piece of + device memory later. */ struct ptx_free_block *n = GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block)); n->ptr = p; pthread_mutex_lock (&ptx_dev->free_blocks_lock); n->next = ptx_dev->free_blocks; ptx_dev->free_blocks = n; pthread_mutex_unlock (&ptx_dev->free_blocks_lock); return true; } - - CUdeviceptr pb; - size_t ps; - - CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) p); + else if (r != CUDA_SUCCESS) + { + GOMP_PLUGIN_error ("cuMemGetAddressRange error: %s", cuda_error (r)); + return false; + } if ((CUdeviceptr) p != pb) { GOMP_PLUGIN_error ("invalid device address"); return false; }