Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2021-01-05 Thread Jakub Jelinek via Gcc-patches
On Tue, Jan 05, 2021 at 12:13:59PM +, Julian Brown wrote:
> Just to check, does my reply below address your concerns --
> particularly with regards to the current usage of CUDA streams
> serializing kernel executions from different host threads? Given that
> situation, and the observed speed improvement with OpenMP offloading to
> NVPTX with the patch, I'm not sure how much sense it makes to do
> anything more sophisticated than this -- especially without a test case
> that demonstrates a performance regression (or an exacerbated
> out-of-memory condition) with the patch.

I guess I can live with it for GCC 11, but would like this to be
reconsidered for GCC 12, people do run OpenMP offloading code from multiple
often concurrent threads and we shouldn't serialize it unnecessarily.

Jakub



Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2021-01-05 Thread Julian Brown
Hi Jakub,

Just to check, does my reply below address your concerns --
particularly with regards to the current usage of CUDA streams
serializing kernel executions from different host threads? Given that
situation, and the observed speed improvement with OpenMP offloading to
NVPTX with the patch, I'm not sure how much sense it makes to do
anything more sophisticated than this -- especially without a test case
that demonstrates a performance regression (or an exacerbated
out-of-memory condition) with the patch.

Thanks,

Julian

On Tue, 15 Dec 2020 23:16:48 +
Julian Brown  wrote:

> On Tue, 15 Dec 2020 18:00:36 +0100
> Jakub Jelinek  wrote:
> 
> > On Tue, Dec 15, 2020 at 04:49:38PM +, Julian Brown wrote:  
> > > > Do you need to hold the omp_stacks.lock across the entire
> > > > offloading? Doesn't that serialize all offloading kernels to the
> > > > same device? I mean, can't the lock be taken just shortly at the
> > > > start to either acquire the cached stacks or allocate a fresh
> > > > stack, and then at the end to put the stack back into the
> > > > cache?
> > > 
> > > I think you're suggesting something like what Alexander mentioned
> > > -- a pool of cached stacks blocks in case the single, locked block
> > > is contested. Obviously at present kernel launches are serialised
> > > on the target anyway, so it's a question of whether having the
> > > device wait for the host to unlock the stacks block (i.e. a
> > > context switch, FSVO context switch), or allocating a new stacks
> > > block, is quicker. I think the numbers posted in the parent email
> > > show that memory allocation is so slow that just waiting for the
> > > lock wins. I'm wary of adding unnecessary complication,
> > > especially if it'll only be exercised in already hard-to-debug
> > > cases (i.e. lots of threads)!
> > 
> > I'm not suggesting to have multiple stacks, on the contrary.  I've
> > suggested to do the caching only if at most one host thread is
> > offloading to the device.
> > 
> > If one uses
> > #pragma omp parallel num_threads(3)
> > {
> >   #pragma omp target
> >   ...
> > }
> > then I don't see what would previously prevent the concurrent
> > offloading, yes, we take the device lock during gomp_map_vars and
> > again during gomp_unmap_vars, but don't hold it across the
> > offloading in between.  
> 
> I still don't think I quite understand what you're getting at.
> 
> We only implement synchronous launches for OpenMP on NVPTX at present,
> and those all use the default CUDA runtime driver stream. Only one
> kernel executes on the hardware at once, even if launched from
> different host threads. The serialisation isn't due to the device lock
> being held, but by the queueing semantics of the underlying API.
> 
> > > Does target-side memory allocation call back into the plugin's
> > > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not,
> > > target-side memory allocation shouldn't be affected, I don't
> > > think?
> > 
> > Again, I'm not suggesting that it should, but what I'm saying is
> > that if target region ends but some other host tasks are doing
> > target regions to the same device concurrently with that, or if
> > there are async target in fly, we shouldn't try to cache the stack,
> > but free it right away, because what the other target regions might
> > need to malloc larger amounts of memory and fail because of the
> > caching.  
> 
> I'm assuming you're not suggesting fundamentally changing APIs or
> anything to determine if we're launching target regions from multiple
> threads at once, but instead that we try to detect the condition
> dynamically in the plugin?
> 
> So, would kernel launch look something like this? (Excuse
> pseudo-code-isms!)
> 
> void GOMP_OFFLOAD_run (...)
> {
>   bool used_cache;
> 
>   pthread_mutex_lock (_dev->omp_stacks.lock);
>   if (_dev->omp_stacks.usage_count > 0)
>   {
> cuCtxSynchronize ();
> nvptx_stacks_free (_dev);
> ...allocate fresh stack, no caching...
> used_cache = false;
>   }
>   else
>   {
> /* Allocate or re-use cached stacks, and then... */
> ptx_dev->omp_stacks.usage_count++;
> used_cache = true;
>   }
>   pthread_mutex_unlock (_dev->omp_stacks.lock);
> 
>   /* Launch kernel */
> 
>   if (used_cache) {
> cuStreamAddCallback (
>   pthread_mutex_lock (_dev->omp_stacks.lock);
>   ptx_dev->omp_stacks.usage_count--;
>   pthread_mutex_unlock (_dev->omp_stacks.lock);
> );
>   } else {
> pthread_mutex_lock (_dev->omp_stacks.lock);
> /* Free uncached stack */
> pthread_mutex_unlock (_dev->omp_stacks.lock);
>   }
> }
> 
> This seems like it'd be rather fragile to me, and would offer some
> benefit perhaps only if a previous cached stacks block was much larger
> than the one required for some given later launch. It wouldn't allow
> any additional parallelism on the target I don't think.
> 
> Is that sort-of what you meant?
> 
> Oh, or perhaps something more like checking cuStreamQuery at 

Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-12-15 Thread Julian Brown
On Tue, 15 Dec 2020 18:00:36 +0100
Jakub Jelinek  wrote:

> On Tue, Dec 15, 2020 at 04:49:38PM +, Julian Brown wrote:
> > > Do you need to hold the omp_stacks.lock across the entire
> > > offloading? Doesn't that serialize all offloading kernels to the
> > > same device? I mean, can't the lock be taken just shortly at the
> > > start to either acquire the cached stacks or allocate a fresh
> > > stack, and then at the end to put the stack back into the cache?  
> > 
> > I think you're suggesting something like what Alexander mentioned
> > -- a pool of cached stacks blocks in case the single, locked block
> > is contested. Obviously at present kernel launches are serialised
> > on the target anyway, so it's a question of whether having the
> > device wait for the host to unlock the stacks block (i.e. a context
> > switch, FSVO context switch), or allocating a new stacks block, is
> > quicker. I think the numbers posted in the parent email show that
> > memory allocation is so slow that just waiting for the lock wins.
> > I'm wary of adding unnecessary complication, especially if it'll
> > only be exercised in already hard-to-debug cases (i.e. lots of
> > threads)!  
> 
> I'm not suggesting to have multiple stacks, on the contrary.  I've
> suggested to do the caching only if at most one host thread is
> offloading to the device.
> 
> If one uses
> #pragma omp parallel num_threads(3)
> {
>   #pragma omp target
>   ...
> }
> then I don't see what would previously prevent the concurrent
> offloading, yes, we take the device lock during gomp_map_vars and
> again during gomp_unmap_vars, but don't hold it across the offloading
> in between.

I still don't think I quite understand what you're getting at.

We only implement synchronous launches for OpenMP on NVPTX at present,
and those all use the default CUDA runtime driver stream. Only one
kernel executes on the hardware at once, even if launched from
different host threads. The serialisation isn't due to the device lock
being held, but by the queueing semantics of the underlying API.

> > Does target-side memory allocation call back into the plugin's
> > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
> > memory allocation shouldn't be affected, I don't think?  
> 
> Again, I'm not suggesting that it should, but what I'm saying is that
> if target region ends but some other host tasks are doing target
> regions to the same device concurrently with that, or if there are
> async target in fly, we shouldn't try to cache the stack, but free it
> right away, because what the other target regions might need to
> malloc larger amounts of memory and fail because of the caching.

I'm assuming you're not suggesting fundamentally changing APIs or
anything to determine if we're launching target regions from multiple
threads at once, but instead that we try to detect the condition
dynamically in the plugin?

So, would kernel launch look something like this? (Excuse
pseudo-code-isms!)

void GOMP_OFFLOAD_run (...)
{
  bool used_cache;

  pthread_mutex_lock (_dev->omp_stacks.lock);
  if (_dev->omp_stacks.usage_count > 0)
  {
cuCtxSynchronize ();
nvptx_stacks_free (_dev);
...allocate fresh stack, no caching...
used_cache = false;
  }
  else
  {
/* Allocate or re-use cached stacks, and then... */
ptx_dev->omp_stacks.usage_count++;
used_cache = true;
  }
  pthread_mutex_unlock (_dev->omp_stacks.lock);

  /* Launch kernel */

  if (used_cache) {
cuStreamAddCallback (
  pthread_mutex_lock (_dev->omp_stacks.lock);
  ptx_dev->omp_stacks.usage_count--;
  pthread_mutex_unlock (_dev->omp_stacks.lock);
);
  } else {
pthread_mutex_lock (_dev->omp_stacks.lock);
/* Free uncached stack */
pthread_mutex_unlock (_dev->omp_stacks.lock);
  }
}

This seems like it'd be rather fragile to me, and would offer some
benefit perhaps only if a previous cached stacks block was much larger
than the one required for some given later launch. It wouldn't allow
any additional parallelism on the target I don't think.

Is that sort-of what you meant?

Oh, or perhaps something more like checking cuStreamQuery at the end of
the kernel launch to see if more work (...from other threads) is
outstanding on the same queue? I think that only usefully returns
CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help.

Thanks for clarification (& apologies for being slow!),

Julian


Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-12-15 Thread Jakub Jelinek via Gcc-patches
On Tue, Dec 15, 2020 at 04:49:38PM +, Julian Brown wrote:
> > Do you need to hold the omp_stacks.lock across the entire offloading?
> > Doesn't that serialize all offloading kernels to the same device?
> > I mean, can't the lock be taken just shortly at the start to either
> > acquire the cached stacks or allocate a fresh stack, and then at the
> > end to put the stack back into the cache?
> 
> I think you're suggesting something like what Alexander mentioned -- a
> pool of cached stacks blocks in case the single, locked block is
> contested. Obviously at present kernel launches are serialised on the
> target anyway, so it's a question of whether having the device wait for
> the host to unlock the stacks block (i.e. a context switch, FSVO context
> switch), or allocating a new stacks block, is quicker. I think the
> numbers posted in the parent email show that memory allocation is so
> slow that just waiting for the lock wins. I'm wary of adding
> unnecessary complication, especially if it'll only be exercised in
> already hard-to-debug cases (i.e. lots of threads)!

I'm not suggesting to have multiple stacks, on the contrary.  I've suggested
to do the caching only if at most one host thread is offloading to the
device.

If one uses
#pragma omp parallel num_threads(3)
{
  #pragma omp target
  ...
}
then I don't see what would previously prevent the concurrent offloading,
yes, we take the device lock during gomp_map_vars and again during
gomp_unmap_vars, but don't hold it across the offloading in between.

> Does target-side memory allocation call back into the plugin's
> GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
> memory allocation shouldn't be affected, I don't think?

Again, I'm not suggesting that it should, but what I'm saying is that
if target region ends but some other host tasks are doing target regions to
the same device concurrently with that, or if there are async target in fly,
we shouldn't try to cache the stack, but free it right away, because
what the other target regions might need to malloc larger amounts of memory
and fail because of the caching.

Jakub



Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-12-15 Thread Julian Brown
On Tue, 15 Dec 2020 14:49:40 +0100
Jakub Jelinek  wrote:

> On Tue, Dec 15, 2020 at 01:39:13PM +, Julian Brown wrote:
> > @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void
> > *tgt_vars, void **args) nvptx_adjust_launch_bounds (tgt_fn,
> > ptx_dev, , ); 
> >size_t stack_size = nvptx_stacks_size ();
> > -  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
> > +
> > +  pthread_mutex_lock (_dev->omp_stacks.lock);
> > +  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams
> > * threads); void *fn_args[] = {tgt_vars, stacks, (void *)
> > stack_size}; size_t fn_args_size = sizeof fn_args;
> >void *config[] = {
> > @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void
> > *tgt_vars, void **args) maybe_abort_msg);
> >else if (r != CUDA_SUCCESS)
> >  GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error
> > (r));
> > -  nvptx_stacks_free (stacks, teams * threads);
> > +
> > +  pthread_mutex_unlock (_dev->omp_stacks.lock);
> >  }  
> 
> Do you need to hold the omp_stacks.lock across the entire offloading?
> Doesn't that serialize all offloading kernels to the same device?
> I mean, can't the lock be taken just shortly at the start to either
> acquire the cached stacks or allocate a fresh stack, and then at the
> end to put the stack back into the cache?

I think you're suggesting something like what Alexander mentioned -- a
pool of cached stacks blocks in case the single, locked block is
contested. Obviously at present kernel launches are serialised on the
target anyway, so it's a question of whether having the device wait for
the host to unlock the stacks block (i.e. a context switch, FSVO context
switch), or allocating a new stacks block, is quicker. I think the
numbers posted in the parent email show that memory allocation is so
slow that just waiting for the lock wins. I'm wary of adding
unnecessary complication, especially if it'll only be exercised in
already hard-to-debug cases (i.e. lots of threads)!

Just ignoring the cache if it's "in use" (and doing an allocation/free
of another stacks block, as at present) is something I'd not quite
considered. Indeed that might work, but I'm not sure if it'll be
any faster in practice.

> Also, how will this caching interact with malloc etc. performed in
> target regions?  Shall we do the caching only if there is no other
> concurrent offloading to the device because the newlib malloc will
> not be able to figure out it could free this and let the host know it
> has freed it.

Does target-side memory allocation call back into the plugin's
GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
memory allocation shouldn't be affected, I don't think?

Thanks,

Julian


Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-12-15 Thread Jakub Jelinek via Gcc-patches
On Tue, Dec 15, 2020 at 01:39:13PM +, Julian Brown wrote:
> @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void 
> *tgt_vars, void **args)
>nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, , );
>  
>size_t stack_size = nvptx_stacks_size ();
> -  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
> +
> +  pthread_mutex_lock (_dev->omp_stacks.lock);
> +  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
>void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
>size_t fn_args_size = sizeof fn_args;
>void *config[] = {
> @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void 
> *tgt_vars, void **args)
>  maybe_abort_msg);
>else if (r != CUDA_SUCCESS)
>  GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
> -  nvptx_stacks_free (stacks, teams * threads);
> +
> +  pthread_mutex_unlock (_dev->omp_stacks.lock);
>  }

Do you need to hold the omp_stacks.lock across the entire offloading?
Doesn't that serialize all offloading kernels to the same device?
I mean, can't the lock be taken just shortly at the start to either acquire
the cached stacks or allocate a fresh stack, and then at the end to put the
stack back into the cache?

Also, how will this caching interact with malloc etc. performed in target
regions?  Shall we do the caching only if there is no other concurrent
offloading to the device because the newlib malloc will not be able to
figure out it could free this and let the host know it has freed it.

Jakub



Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-12-15 Thread Julian Brown
On Tue, 8 Dec 2020 20:11:38 +0300
Alexander Monakov  wrote:

> On Tue, 8 Dec 2020, Julian Brown wrote:
> 
> > Ping?  
> 
> This has addressed my concerns, thanks.

Jakub, Tom -- just to confirm, is this OK for trunk now?

I noticed a slight bugfix myself in the no-stacks/out-of-memory case --
i.e. for OpenACC, in nvptx_stacks_free. The attached version of the
patch includes that fix.

Thanks,

Julian

commit 2e4160f0f0532890f6a2b405222f02c7ee1bb0ac
Author: Julian Brown 
Date:   Wed Oct 21 10:00:19 2020 -0700

nvptx: Cache stacks block for OpenMP kernel launch

2020-11-13  Julian Brown  

libgomp/
* plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define.
(struct ptx_device): Add omp_stacks struct.
(nvptx_open_device): Initialise cached-stacks housekeeping info.
(nvptx_close_device): Free cached stacks block and mutex.
(nvptx_stacks_free): New function.
(nvptx_alloc): Add SUPPRESS_ERRORS parameter.
(GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks block.
(nvptx_stacks_alloc): Rename to...
(nvptx_stacks_acquire): This.  Cache stacks block between runs if same
size or smaller is required.
(nvptx_stacks_free): Remove.
(GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks block
during kernel execution.

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 11d4ceeae62e..e08c36094098 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -49,6 +49,15 @@
 #include 
 #include 
 
+/* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
+   block to cache between kernel invocations.  For soft-stacks blocks bigger
+   than this, we will free the block before attempting another GPU memory
+   allocation (i.e. in GOMP_OFFLOAD_alloc).  Otherwise, if an allocation fails,
+   we will free the cached soft-stacks block anyway then retry the
+   allocation.  If that fails too, we lose.  */
+
+#define SOFTSTACK_CACHE_LIMIT 134217728
+
 #if CUDA_VERSION < 6000
 extern CUresult cuGetErrorString (CUresult, const char **);
 #define CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR 82
@@ -307,6 +316,14 @@ struct ptx_device
   struct ptx_free_block *free_blocks;
   pthread_mutex_t free_blocks_lock;
 
+  /* OpenMP stacks, cached between kernel invocations.  */
+  struct
+{
+  CUdeviceptr ptr;
+  size_t size;
+  pthread_mutex_t lock;
+} omp_stacks;
+
   struct ptx_device *next;
 };
 
@@ -514,6 +531,10 @@ nvptx_open_device (int n)
   ptx_dev->free_blocks = NULL;
   pthread_mutex_init (_dev->free_blocks_lock, NULL);
 
+  ptx_dev->omp_stacks.ptr = 0;
+  ptx_dev->omp_stacks.size = 0;
+  pthread_mutex_init (_dev->omp_stacks.lock, NULL);
+
   return ptx_dev;
 }
 
@@ -534,6 +555,11 @@ nvptx_close_device (struct ptx_device *ptx_dev)
   pthread_mutex_destroy (_dev->free_blocks_lock);
   pthread_mutex_destroy (_dev->image_lock);
 
+  pthread_mutex_destroy (_dev->omp_stacks.lock);
+
+  if (ptx_dev->omp_stacks.ptr)
+CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr);
+
   if (!ptx_dev->ctx_shared)
 CUDA_CALL (cuCtxDestroy, ptx_dev->ctx);
 
@@ -999,12 +1025,40 @@ goacc_profiling_acc_ev_alloc (struct goacc_thread *thr, void *dp, size_t s)
   GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, _event_info, api_info);
 }
 
+/* Free the cached soft-stacks block if it is above the SOFTSTACK_CACHE_LIMIT
+   size threshold, or if FORCE is true.  */
+
+static void
+nvptx_stacks_free (struct ptx_device *ptx_dev, bool force)
+{
+  pthread_mutex_lock (_dev->omp_stacks.lock);
+  if (ptx_dev->omp_stacks.ptr
+  && (force || ptx_dev->omp_stacks.size > SOFTSTACK_CACHE_LIMIT))
+{
+  CUresult r = CUDA_CALL_NOCHECK (cuMemFree, ptx_dev->omp_stacks.ptr);
+  if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+  ptx_dev->omp_stacks.ptr = 0;
+  ptx_dev->omp_stacks.size = 0;
+}
+  pthread_mutex_unlock (_dev->omp_stacks.lock);
+}
+
 static void *
-nvptx_alloc (size_t s)
+nvptx_alloc (size_t s, bool suppress_errors)
 {
   CUdeviceptr d;
 
-  CUDA_CALL_ERET (NULL, cuMemAlloc, , s);
+  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, , s);
+  if (suppress_errors && r == CUDA_ERROR_OUT_OF_MEMORY)
+return NULL;
+  else if (r != CUDA_SUCCESS)
+{
+  GOMP_PLUGIN_error ("nvptx_alloc error: %s", cuda_error (r));
+  return NULL;
+}
+
+  /* NOTE: We only do profiling stuff if the memory allocation succeeds.  */
   struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
   bool profiling_p
 = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
@@ -1352,6 +1406,8 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
   ptx_dev->free_blocks = NULL;
   pthread_mutex_unlock (_dev->free_blocks_lock);
 
+  nvptx_stacks_free (ptx_dev, false);
+
   while (blocks)
 {
   tmp = blocks->next;
@@ -1360,7 +1416,16 @@ 

Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-12-08 Thread Alexander Monakov via Gcc-patches


On Tue, 8 Dec 2020, Julian Brown wrote:

> Ping?

This has addressed my concerns, thanks.

Alexander

> On Fri, 13 Nov 2020 20:54:54 +
> Julian Brown  wrote:
> 
> > Hi Alexander,
> > 
> > Thanks for the review! Comments below.
> > 
> > On Tue, 10 Nov 2020 00:32:36 +0300
> > Alexander Monakov  wrote:
> > 
> > > On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> > >   
> > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > > > > This patch adds caching for the stack block allocated for
> > > > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > > > optimisation -- we observed an average 11% or so performance
> > > > > improvement with this patch across a set of accelerated GPU
> > > > > benchmarks on one machine (results vary according to individual
> > > > > benchmark and with hardware used).
> > > 
> > > In this patch you're folding two changes together: reuse of
> > > allocated stacks and removing one host-device synchronization.  Why
> > > is that? Can you report performance change separately for each
> > > change (and split out the patches)?  
> > 
> > An accident of the development process of the patch, really -- the
> > idea for removing the post-kernel-launch synchronisation came from the
> > OpenACC side, and adapting it to OpenMP meant the stacks had to remain
> > allocated after the return of the GOMP_OFFLOAD_run function.
> > 
> > > > > A given kernel launch will reuse the stack block from the
> > > > > previous launch if it is large enough, else it is freed and
> > > > > reallocated. A slight caveat is that memory will not be freed
> > > > > until the device is closed, so e.g. if code is using highly
> > > > > variable launch geometries and large amounts of GPU RAM, you
> > > > > might run out of resources slightly quicker with this patch.
> > > > > 
> > > > > Another way this patch gains performance is by omitting the
> > > > > synchronisation at the end of an OpenMP offload kernel launch --
> > > > > it's safe for the GPU and CPU to continue executing in parallel
> > > > > at that point, because e.g. copies-back from the device will be
> > > > > synchronised properly with kernel completion anyway.
> > > 
> > > I don't think this explanation is sufficient. My understanding is
> > > that OpenMP forbids the host to proceed asynchronously after the
> > > target construct unless it is a 'target nowait' construct. This may
> > > be observable if there's a printf in the target region for example
> > > (or if it accesses memory via host pointers).
> > > 
> > > So this really needs to be a separate patch with more explanation
> > > why this is okay (if it is okay).  
> > 
> > As long as the offload kernel only touches GPU memory and does not
> > have any CPU-visible side effects (like the printf you mentioned -- I
> > hadn't really considered that, oops!), it's probably OK.
> > 
> > But anyway, the benefit obtained on OpenMP code (the same set of
> > benchmarks run before) of omitting the synchronisation at the end of
> > GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
> > stacks caching, and miss out the synchronisation removal for now. (It
> > might still be something worth considering later, perhaps, as long as
> > we can show some given kernel doesn't use printf or access memory via
> > host pointers -- I guess the former might be easier than the latter. I
> > have observed the equivalent OpenACC patch provide a significant boost
> > on some benchmarks, so there's probably something that could be gained
> > on the OpenMP side too.)
> > 
> > The benefit with the attached patch -- just stacks caching, no
> > synchronisation removal -- is about 12% on the same set of benchmarks
> > as before. Results are a little noisy on the machine I'm benchmarking
> > on, so this isn't necessarily proof that the synchronisation removal
> > is harmful for performance!
> > 
> > > > > In turn, the last part necessitates a change to the way
> > > > > "(perhaps abort was called)" errors are detected and reported.
> > > > >   
> > > 
> > > As already mentioned using callbacks is problematic. Plus, I'm sure
> > > the way you lock out other threads is a performance loss when
> > > multiple threads have target regions: even though they will not run
> > > concurrently on the GPU, you still want to allow host threads to
> > > submit GPU jobs while the GPU is occupied.
> > > 
> > > I would suggest to have a small pool (up to 3 entries perhaps) of
> > > stacks. Then you can arrange reuse without totally serializing host
> > > threads on target regions.  
> > 
> > I'm really wary of the additional complexity of adding a stack pool,
> > and the memory allocation/freeing code paths in CUDA appear to be so
> > slow that we get a benefit with this patch even when the GPU stream
> > has to wait for the CPU to unlock the stacks block. Also, for large
> > GPU launches, the size of the soft-stacks block isn't really trivial
> > (I've seen something like 50MB on the 

Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-12-07 Thread Julian Brown
Ping?

Thanks,

Julian

On Fri, 13 Nov 2020 20:54:54 +
Julian Brown  wrote:

> Hi Alexander,
> 
> Thanks for the review! Comments below.
> 
> On Tue, 10 Nov 2020 00:32:36 +0300
> Alexander Monakov  wrote:
> 
> > On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> >   
> > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > > > This patch adds caching for the stack block allocated for
> > > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > > optimisation -- we observed an average 11% or so performance
> > > > improvement with this patch across a set of accelerated GPU
> > > > benchmarks on one machine (results vary according to individual
> > > > benchmark and with hardware used).
> > 
> > In this patch you're folding two changes together: reuse of
> > allocated stacks and removing one host-device synchronization.  Why
> > is that? Can you report performance change separately for each
> > change (and split out the patches)?  
> 
> An accident of the development process of the patch, really -- the
> idea for removing the post-kernel-launch synchronisation came from the
> OpenACC side, and adapting it to OpenMP meant the stacks had to remain
> allocated after the return of the GOMP_OFFLOAD_run function.
> 
> > > > A given kernel launch will reuse the stack block from the
> > > > previous launch if it is large enough, else it is freed and
> > > > reallocated. A slight caveat is that memory will not be freed
> > > > until the device is closed, so e.g. if code is using highly
> > > > variable launch geometries and large amounts of GPU RAM, you
> > > > might run out of resources slightly quicker with this patch.
> > > > 
> > > > Another way this patch gains performance is by omitting the
> > > > synchronisation at the end of an OpenMP offload kernel launch --
> > > > it's safe for the GPU and CPU to continue executing in parallel
> > > > at that point, because e.g. copies-back from the device will be
> > > > synchronised properly with kernel completion anyway.
> > 
> > I don't think this explanation is sufficient. My understanding is
> > that OpenMP forbids the host to proceed asynchronously after the
> > target construct unless it is a 'target nowait' construct. This may
> > be observable if there's a printf in the target region for example
> > (or if it accesses memory via host pointers).
> > 
> > So this really needs to be a separate patch with more explanation
> > why this is okay (if it is okay).  
> 
> As long as the offload kernel only touches GPU memory and does not
> have any CPU-visible side effects (like the printf you mentioned -- I
> hadn't really considered that, oops!), it's probably OK.
> 
> But anyway, the benefit obtained on OpenMP code (the same set of
> benchmarks run before) of omitting the synchronisation at the end of
> GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
> stacks caching, and miss out the synchronisation removal for now. (It
> might still be something worth considering later, perhaps, as long as
> we can show some given kernel doesn't use printf or access memory via
> host pointers -- I guess the former might be easier than the latter. I
> have observed the equivalent OpenACC patch provide a significant boost
> on some benchmarks, so there's probably something that could be gained
> on the OpenMP side too.)
> 
> The benefit with the attached patch -- just stacks caching, no
> synchronisation removal -- is about 12% on the same set of benchmarks
> as before. Results are a little noisy on the machine I'm benchmarking
> on, so this isn't necessarily proof that the synchronisation removal
> is harmful for performance!
> 
> > > > In turn, the last part necessitates a change to the way
> > > > "(perhaps abort was called)" errors are detected and reported.
> > > >   
> > 
> > As already mentioned using callbacks is problematic. Plus, I'm sure
> > the way you lock out other threads is a performance loss when
> > multiple threads have target regions: even though they will not run
> > concurrently on the GPU, you still want to allow host threads to
> > submit GPU jobs while the GPU is occupied.
> > 
> > I would suggest to have a small pool (up to 3 entries perhaps) of
> > stacks. Then you can arrange reuse without totally serializing host
> > threads on target regions.  
> 
> I'm really wary of the additional complexity of adding a stack pool,
> and the memory allocation/freeing code paths in CUDA appear to be so
> slow that we get a benefit with this patch even when the GPU stream
> has to wait for the CPU to unlock the stacks block. Also, for large
> GPU launches, the size of the soft-stacks block isn't really trivial
> (I've seen something like 50MB on the hardware I'm using, with default
> options), and multiplying that by 3 could start to eat into the GPU
> heap memory for "useful data" quite significantly.
> 
> Consider the attached (probably not amazingly-written) microbenchmark.
> It spawns 8 threads which each 

Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-11-13 Thread Julian Brown
Hi Alexander,

Thanks for the review! Comments below.

On Tue, 10 Nov 2020 00:32:36 +0300
Alexander Monakov  wrote:

> On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> 
> > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:  
> > > This patch adds caching for the stack block allocated for
> > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > optimisation -- we observed an average 11% or so performance
> > > improvement with this patch across a set of accelerated GPU
> > > benchmarks on one machine (results vary according to individual
> > > benchmark and with hardware used).  
> 
> In this patch you're folding two changes together: reuse of allocated
> stacks and removing one host-device synchronization.  Why is that?
> Can you report performance change separately for each change (and
> split out the patches)?

An accident of the development process of the patch, really -- the idea
for removing the post-kernel-launch synchronisation came from the
OpenACC side, and adapting it to OpenMP meant the stacks had to remain
allocated after the return of the GOMP_OFFLOAD_run function.

> > > A given kernel launch will reuse the stack block from the
> > > previous launch if it is large enough, else it is freed and
> > > reallocated. A slight caveat is that memory will not be freed
> > > until the device is closed, so e.g. if code is using highly
> > > variable launch geometries and large amounts of GPU RAM, you
> > > might run out of resources slightly quicker with this patch.
> > > 
> > > Another way this patch gains performance is by omitting the
> > > synchronisation at the end of an OpenMP offload kernel launch --
> > > it's safe for the GPU and CPU to continue executing in parallel
> > > at that point, because e.g. copies-back from the device will be
> > > synchronised properly with kernel completion anyway.  
> 
> I don't think this explanation is sufficient. My understanding is
> that OpenMP forbids the host to proceed asynchronously after the
> target construct unless it is a 'target nowait' construct. This may
> be observable if there's a printf in the target region for example
> (or if it accesses memory via host pointers).
> 
> So this really needs to be a separate patch with more explanation why
> this is okay (if it is okay).

As long as the offload kernel only touches GPU memory and does not have
any CPU-visible side effects (like the printf you mentioned -- I hadn't
really considered that, oops!), it's probably OK.

But anyway, the benefit obtained on OpenMP code (the same set of
benchmarks run before) of omitting the synchronisation at the end of
GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
stacks caching, and miss out the synchronisation removal for now. (It
might still be something worth considering later, perhaps, as long as
we can show some given kernel doesn't use printf or access memory via
host pointers -- I guess the former might be easier than the latter. I
have observed the equivalent OpenACC patch provide a significant boost
on some benchmarks, so there's probably something that could be gained
on the OpenMP side too.)

The benefit with the attached patch -- just stacks caching, no
synchronisation removal -- is about 12% on the same set of benchmarks
as before. Results are a little noisy on the machine I'm benchmarking
on, so this isn't necessarily proof that the synchronisation removal is
harmful for performance!

> > > In turn, the last part necessitates a change to the way "(perhaps
> > > abort was called)" errors are detected and reported.  
> 
> As already mentioned using callbacks is problematic. Plus, I'm sure
> the way you lock out other threads is a performance loss when
> multiple threads have target regions: even though they will not run
> concurrently on the GPU, you still want to allow host threads to
> submit GPU jobs while the GPU is occupied.
> 
> I would suggest to have a small pool (up to 3 entries perhaps) of
> stacks. Then you can arrange reuse without totally serializing host
> threads on target regions.

I'm really wary of the additional complexity of adding a stack pool,
and the memory allocation/freeing code paths in CUDA appear to be so
slow that we get a benefit with this patch even when the GPU stream has
to wait for the CPU to unlock the stacks block. Also, for large GPU
launches, the size of the soft-stacks block isn't really trivial (I've
seen something like 50MB on the hardware I'm using, with default
options), and multiplying that by 3 could start to eat into the GPU
heap memory for "useful data" quite significantly.

Consider the attached (probably not amazingly-written) microbenchmark.
It spawns 8 threads which each launch lots of OpenMP kernels
performing some trivial work, then joins the threads and checks the
results. As a baseline, with the "FEWER_KERNELS" parameters set (256
kernel launches over 8 threads), this gives us over 5 runs:

real3m55.375s
user7m14.192s
sys 0m30.148s

real

Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-11-09 Thread Alexander Monakov via Gcc-patches
On Mon, 26 Oct 2020, Jakub Jelinek wrote:

> On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > This patch adds caching for the stack block allocated for offloaded
> > OpenMP kernel launches on NVPTX. This is a performance optimisation --
> > we observed an average 11% or so performance improvement with this patch
> > across a set of accelerated GPU benchmarks on one machine (results vary
> > according to individual benchmark and with hardware used).

In this patch you're folding two changes together: reuse of allocated stacks
and removing one host-device synchronization.  Why is that?  Can you report
performance change separately for each change (and split out the patches)?

> > A given kernel launch will reuse the stack block from the previous launch
> > if it is large enough, else it is freed and reallocated. A slight caveat
> > is that memory will not be freed until the device is closed, so e.g. if
> > code is using highly variable launch geometries and large amounts of
> > GPU RAM, you might run out of resources slightly quicker with this patch.
> > 
> > Another way this patch gains performance is by omitting the
> > synchronisation at the end of an OpenMP offload kernel launch -- it's
> > safe for the GPU and CPU to continue executing in parallel at that point,
> > because e.g. copies-back from the device will be synchronised properly
> > with kernel completion anyway.

I don't think this explanation is sufficient. My understanding is that OpenMP
forbids the host to proceed asynchronously after the target construct unless
it is a 'target nowait' construct. This may be observable if there's a printf
in the target region for example (or if it accesses memory via host pointers).

So this really needs to be a separate patch with more explanation why this is
okay (if it is okay).

> > In turn, the last part necessitates a change to the way "(perhaps abort
> > was called)" errors are detected and reported.

As already mentioned using callbacks is problematic. Plus, I'm sure the way
you lock out other threads is a performance loss when multiple threads have
target regions: even though they will not run concurrently on the GPU, you
still want to allow host threads to submit GPU jobs while the GPU is occupied.

I would suggest to have a small pool (up to 3 entries perhaps) of stacks. Then
you can arrange reuse without totally serializing host threads on target
regions.

Alexander


Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-10-28 Thread Julian Brown
On Wed, 28 Oct 2020 15:25:56 +0800
Chung-Lin Tang  wrote:

> On 2020/10/27 9:17 PM, Julian Brown wrote:
> >> And, in which context are cuStreamAddCallback registered callbacks
> >> run? E.g. if it is inside of asynchronous interrput, using locking
> >> in there might not be the best thing to do.  
> > The cuStreamAddCallback API is documented here:
> > 
> > https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
> > 
> > We're quite limited in what we can do in the callback function since
> > "Callbacks must not make any CUDA API calls". So what*can*  a
> > callback function do? It is mentioned that the callback function's
> > execution will "pause" the stream it is logically running on. So
> > can we get deadlock, e.g. if multiple host threads are launching
> > offload kernels simultaneously? I don't think so, but I don't know
> > how to prove it!  
> 
> I think it's not deadlock that's a problem here, but that the locking
> acquiring in nvptx_stack_acquire will effectively serialize GPU
> kernel execution to just one host thread (since you're holding it
> till kernel completion). Also in that case, why do you need to use a
> CUDA callback? You can just call the unlock directly afterwards.

IIUC, there's a single GPU queue used for synchronous launches no
matter which host thread initiates the operation, and kernel execution
is serialised anyway, so that shouldn't be a problem. The only way to
get different kernels executing simultaneously is to use different CUDA
streams -- but I think that's still TBD for OpenMP ("TODO: Implement
GOMP_OFFLOAD_async_run").

> I think a better way is to use a list of stack blocks in ptx_dev, and
> quickly retrieve/unlock it in nvptx_stack_acquire, like how we did it
> in GOMP_OFFLOAD_alloc for general device memory allocation.

If it weren't for the serialisation, we could also keep a stack cache
per-host-thread in nvptx_thread. But as it is, I don't think we need the
extra complication. When we do OpenMP async support, maybe a stack
cache can be put per-stream in goacc_asyncqueue or the OpenMP
equivalent.

Thanks,

Julian


Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-10-28 Thread Chung-Lin Tang




On 2020/10/27 9:17 PM, Julian Brown wrote:

And, in which context are cuStreamAddCallback registered callbacks
run? E.g. if it is inside of asynchronous interrput, using locking in
there might not be the best thing to do.

The cuStreamAddCallback API is documented here:

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483

We're quite limited in what we can do in the callback function since
"Callbacks must not make any CUDA API calls". So what*can*  a callback
function do? It is mentioned that the callback function's execution will
"pause" the stream it is logically running on. So can we get deadlock,
e.g. if multiple host threads are launching offload kernels
simultaneously? I don't think so, but I don't know how to prove it!


I think it's not deadlock that's a problem here, but that the locking acquiring
in nvptx_stack_acquire will effectively serialize GPU kernel execution to just
one host thread (since you're holding it till kernel completion).
Also in that case, why do you need to use a CUDA callback? You can just call the
unlock directly afterwards.

I think a better way is to use a list of stack blocks in ptx_dev, and quickly
retrieve/unlock it in nvptx_stack_acquire, like how we did it in 
GOMP_OFFLOAD_alloc for
general device memory allocation.

Chung-Lin


Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-10-27 Thread Julian Brown
(Apologies if threading is broken, for some reason I didn't receive
this reply directly!)

On Mon Oct 26 14:26:34 GMT 2020, Jakub Jelinek wrote:
> On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > This patch adds caching for the stack block allocated for offloaded
> > OpenMP kernel launches on NVPTX. This is a performance optimisation
> > -- we observed an average 11% or so performance improvement with
> > this patch across a set of accelerated GPU benchmarks on one
> > machine (results vary according to individual benchmark and with
> > hardware used).
> > 
> > A given kernel launch will reuse the stack block from the previous
> > launch if it is large enough, else it is freed and reallocated. A
> > slight caveat is that memory will not be freed until the device is
> > closed, so e.g. if code is using highly variable launch geometries
> > and large amounts of GPU RAM, you might run out of resources
> > slightly quicker with this patch.
> > 
> > Another way this patch gains performance is by omitting the
> > synchronisation at the end of an OpenMP offload kernel launch --
> > it's safe for the GPU and CPU to continue executing in parallel at
> > that point, because e.g. copies-back from the device will be
> > synchronised properly with kernel completion anyway.
> > 
> > In turn, the last part necessitates a change to the way "(perhaps
> > abort was called)" errors are detected and reported.
> > 
> > Tested with offloading to NVPTX. OK for mainline?
> 
> I'm afraid I don't know the plugin nor CUDA well enough to review this
> properly (therefore I'd like to hear from Thomas, Tom and/or
> Alexander. Anyway, just two questions, wouldn't it make sense to add
> some upper bound limit over which it wouldn't cache the stacks, so
> that it would cache most of the time for normal programs but if some
> kernel is really excessive and then many normal ones wouldn't result
> in memory allocation failures?

Yes, that might work -- another idea is to free the stacks then retry
if a memory allocation fails, though that might lead to worse
fragmentation, perhaps. For the upper bound idea we'd need to pick a
sensible maximum limit. Something like 16MB maybe? Or,
user-controllable or some fraction of the GPU's total memory?

> And, in which context are cuStreamAddCallback registered callbacks
> run? E.g. if it is inside of asynchronous interrput, using locking in
> there might not be the best thing to do.

The cuStreamAddCallback API is documented here:

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483

We're quite limited in what we can do in the callback function since
"Callbacks must not make any CUDA API calls". So what *can* a callback
function do? It is mentioned that the callback function's execution will
"pause" the stream it is logically running on. So can we get deadlock,
e.g. if multiple host threads are launching offload kernels
simultaneously? I don't think so, but I don't know how to prove it!

Thanks,

Julian


Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-10-26 Thread Jakub Jelinek via Gcc-patches
On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> This patch adds caching for the stack block allocated for offloaded
> OpenMP kernel launches on NVPTX. This is a performance optimisation --
> we observed an average 11% or so performance improvement with this patch
> across a set of accelerated GPU benchmarks on one machine (results vary
> according to individual benchmark and with hardware used).
> 
> A given kernel launch will reuse the stack block from the previous launch
> if it is large enough, else it is freed and reallocated. A slight caveat
> is that memory will not be freed until the device is closed, so e.g. if
> code is using highly variable launch geometries and large amounts of
> GPU RAM, you might run out of resources slightly quicker with this patch.
> 
> Another way this patch gains performance is by omitting the
> synchronisation at the end of an OpenMP offload kernel launch -- it's
> safe for the GPU and CPU to continue executing in parallel at that point,
> because e.g. copies-back from the device will be synchronised properly
> with kernel completion anyway.
> 
> In turn, the last part necessitates a change to the way "(perhaps abort
> was called)" errors are detected and reported.
> 
> Tested with offloading to NVPTX. OK for mainline?

I'm afraid I don't know the plugin nor CUDA well enough to review this
properly (therefore I'd like to hear from Thomas, Tom and/or Alexander.
Anyway, just two questions, wouldn't it make sense to add some upper bound
limit over which it wouldn't cache the stacks, so that it would cache
most of the time for normal programs but if some kernel is really excessive
and then many normal ones wouldn't result in memory allocation failures?

And, in which context are cuStreamAddCallback registered callbacks run?
E.g. if it is inside of asynchronous interrput, using locking in there might
not be the best thing to do.

> -  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> -  if (r == CUDA_ERROR_LAUNCH_FAILED)
> -GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
> -maybe_abort_msg);
> -  else if (r != CUDA_SUCCESS)
> -GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
> -  nvptx_stacks_free (stacks, teams * threads);
> +  CUDA_CALL_ASSERT (cuStreamAddCallback, NULL, nvptx_stacks_release,
> + (void *) ptx_dev, 0);
>  }
>  
>  /* TODO: Implement GOMP_OFFLOAD_async_run. */
> -- 
> 2.28.0

Jakub



[PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-10-26 Thread Julian Brown
Hi,

This patch adds caching for the stack block allocated for offloaded
OpenMP kernel launches on NVPTX. This is a performance optimisation --
we observed an average 11% or so performance improvement with this patch
across a set of accelerated GPU benchmarks on one machine (results vary
according to individual benchmark and with hardware used).

A given kernel launch will reuse the stack block from the previous launch
if it is large enough, else it is freed and reallocated. A slight caveat
is that memory will not be freed until the device is closed, so e.g. if
code is using highly variable launch geometries and large amounts of
GPU RAM, you might run out of resources slightly quicker with this patch.

Another way this patch gains performance is by omitting the
synchronisation at the end of an OpenMP offload kernel launch -- it's
safe for the GPU and CPU to continue executing in parallel at that point,
because e.g. copies-back from the device will be synchronised properly
with kernel completion anyway.

In turn, the last part necessitates a change to the way "(perhaps abort
was called)" errors are detected and reported.

Tested with offloading to NVPTX. OK for mainline?

Thanks,

Julian

2020-10-26  Julian Brown  

libgomp/
* plugin/plugin-nvptx.c (maybe_abort_message): Add function.
(CUDA_CALL_ERET, CUDA_CALL_ASSERT): Use above function.
(struct ptx_device): Add omp_stacks struct.
(nvptx_open_device): Initialise cached-stacks housekeeping info.
(nvptx_close_device): Free cached stacks block and mutex.
(nvptx_stacks_alloc): Rename to...
(nvptx_stacks_acquire): This.  Cache stacks block between runs if same
size or smaller is required.
(nvptx_stacks_free): Rename to...
(nvptx_stacks_release): This.  Do not free stacks block, but release
mutex.
(GOMP_OFFLOAD_run): Adjust for changes to above functions, and remove
special-case "abort" error handling and synchronisation after kernel
launch.
---
 libgomp/plugin/plugin-nvptx.c | 91 ++-
 1 file changed, 68 insertions(+), 23 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 11d4ceeae62e..e7ff5d5213e0 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -137,6 +137,15 @@ init_cuda_lib (void)
 #define MIN(X,Y) ((X) < (Y) ? (X) : (Y))
 #define MAX(X,Y) ((X) > (Y) ? (X) : (Y))
 
+static const char *
+maybe_abort_message (unsigned errmsg)
+{
+  if (errmsg == CUDA_ERROR_LAUNCH_FAILED)
+return " (perhaps abort was called)";
+  else
+return "";
+}
+
 /* Convenience macros for the frequently used CUDA library call and
error handling sequence as well as CUDA library calls that
do the error checking themselves or don't do it at all.  */
@@ -147,8 +156,9 @@ init_cuda_lib (void)
   = CUDA_CALL_PREFIX FN (__VA_ARGS__); \
 if (__r != CUDA_SUCCESS)   \
   {\
-   GOMP_PLUGIN_error (#FN " error: %s",\
-  cuda_error (__r));   \
+   GOMP_PLUGIN_error (#FN " error: %s%s",  \
+  cuda_error (__r),\
+  maybe_abort_message (__r));  \
return ERET;\
   }\
   } while (0)
@@ -162,8 +172,9 @@ init_cuda_lib (void)
   = CUDA_CALL_PREFIX FN (__VA_ARGS__); \
 if (__r != CUDA_SUCCESS)   \
   {\
-   GOMP_PLUGIN_fatal (#FN " error: %s",\
-  cuda_error (__r));   \
+   GOMP_PLUGIN_fatal (#FN " error: %s%s",  \
+  cuda_error (__r),\
+  maybe_abort_message (__r));  \
   }\
   } while (0)
 
@@ -307,6 +318,14 @@ struct ptx_device
   struct ptx_free_block *free_blocks;
   pthread_mutex_t free_blocks_lock;
 
+  /* OpenMP stacks, cached between kernel invocations.  */
+  struct
+{
+  CUdeviceptr ptr;
+  size_t size;
+  pthread_mutex_t lock;
+} omp_stacks;
+
   struct ptx_device *next;
 };
 
@@ -514,6 +533,10 @@ nvptx_open_device (int n)
   ptx_dev->free_blocks = NULL;
   pthread_mutex_init (_dev->free_blocks_lock, NULL);
 
+  ptx_dev->omp_stacks.ptr = 0;
+  ptx_dev->omp_stacks.size = 0;
+  pthread_mutex_init (_dev->omp_stacks.lock, NULL);
+
   return ptx_dev;
 }
 
@@ -534,6 +557,11 @@ nvptx_close_device (struct ptx_device *ptx_dev)
   pthread_mutex_destroy (_dev->free_blocks_lock);
   pthread_mutex_destroy (_dev->image_lock);
 
+  pthread_mutex_destroy (_dev->omp_stacks.lock);
+
+  if (ptx_dev->omp_stacks.ptr)
+CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr);
+
   if (!ptx_dev->ctx_shared)
 CUDA_CALL (cuCtxDestroy, ptx_dev->ctx);
 
@@ -1866,26 +1894,49 @@ nvptx_stacks_size ()
   return