Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v4)

2019-01-08 Thread Chung-Lin Tang

On 2019/1/7 10:15 AM, Thomas Schwinge wrote:

Well, the "Properly handle wait clause with no arguments" changes still
need to be completed and go in first (to avoid introducing regressions),
and then I will have to see your whole set of changes that you intend to
commit: the bits you've incrementally posted still don't include several
of the changes I suggested and provided patches for (again, to avoid
introducing regressions).


I'll look at that state again.


But GCC now is in "regression and documentation fixes mode", so I fear
that it's too late now?


Maybe...I don't know.


--- oacc-async.c(revision 267507)
+++ oacc-async.c(working copy)
@@ -62,12 +158,10 @@ acc_wait (int async)
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  thr->dev->openacc.async.synchronize_func (aq);

Have to check the result here?  Like you're doing here, for example:


  acc_wait_async (int async1, int async2)
  {
+  if (!thr->dev->openacc.async.synchronize_func (aq1))
+gomp_fatal ("wait on %d failed", async1);
+  if (!thr->dev->openacc.async.serialize_func (aq1, aq2))
+gomp_fatal ("ordering of async ids %d and %d failed", async1, async2);
--- oacc-parallel.c (revision 267507)
+++ oacc-parallel.c (working copy)
@@ -521,17 +500,22 @@ goacc_wait (int async, int num_waits, va_list *ap)
if (async == acc_async_sync)
-   acc_wait (qid);
+   acc_dev->openacc.async.synchronize_func (aq);

Likewise?


Oh okay, I forgot about those sites.



Also, I had to apply additional changes as attached, to make this build.



Oh I had those changes, but forgot to update the other patches. I'll resend 
those later too.

Thanks,
Chung-Lin


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v4)

2019-01-07 Thread Thomas Schwinge
Hi Chung-Lin!

On Sat, 5 Jan 2019 17:47:10 +0800, Chung-Lin Tang  
wrote:
> this is the current version of the oacc-* parts of the Async Re-work patch.
> 
> I have reverted away from the earlier mentioned attempt of using lockless
> techniques to manage the asyncqueues; it is really hard to do in a 100% 
> correct
> manner, unless we only use something like simple lists to manage them,
> which probably makes lookup unacceptably slow.
> 
> For now, I have changed to use the conventional locking and success/fail 
> return
> codes for the synchronize/serialize hooks.

OK, thanks.


> I hope this is enough to pass
> and get committed.

Well, the "Properly handle wait clause with no arguments" changes still
need to be completed and go in first (to avoid introducing regressions),
and then I will have to see your whole set of changes that you intend to
commit: the bits you've incrementally posted still don't include several
of the changes I suggested and provided patches for (again, to avoid
introducing regressions).


But GCC now is in "regression and documentation fixes mode", so I fear
that it's too late now?


> --- oacc-async.c  (revision 267507)
> +++ oacc-async.c  (working copy)

> @@ -62,12 +158,10 @@ acc_wait (int async)

> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> +  thr->dev->openacc.async.synchronize_func (aq);

Have to check the result here?  Like you're doing here, for example:

>  acc_wait_async (int async1, int async2)
>  {

> +  if (!thr->dev->openacc.async.synchronize_func (aq1))
> +gomp_fatal ("wait on %d failed", async1);
> +  if (!thr->dev->openacc.async.serialize_func (aq1, aq2))
> +gomp_fatal ("ordering of async ids %d and %d failed", async1, async2);

> --- oacc-parallel.c   (revision 267507)
> +++ oacc-parallel.c   (working copy)

> @@ -521,17 +500,22 @@ goacc_wait (int async, int num_waits, va_list *ap)

>if (async == acc_async_sync)
> - acc_wait (qid);
> + acc_dev->openacc.async.synchronize_func (aq);

Likewise?

>else if (qid == async)
> - ;/* If we're waiting on the same asynchronous queue as we're
> - launching on, the queue itself will order work as
> - required, so there's no need to wait explicitly.  */
> + /* If we're waiting on the same asynchronous queue as we're
> +launching on, the queue itself will order work as
> +required, so there's no need to wait explicitly.  */
> + ;
>else
> - acc_dev->openacc.async_wait_async_func (qid, async);
> + {
> +   goacc_aq aq2 = get_goacc_asyncqueue (async);
> +   acc_dev->openacc.async.synchronize_func (aq);
> +   acc_dev->openacc.async.serialize_func (aq, aq2);
> + }

Likewise?


Also, I had to apply additional changes as attached, to make this build.


Grüße
 Thomas


>From e4c187a4be46682a989165c38bc6a8d8324554b9 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Mon, 7 Jan 2019 13:25:18 +0100
Subject: [PATCH] [WIP] into async re-work: complete
 GOMP_OFFLOAD_openacc_async_synchronize, GOMP_OFFLOAD_openacc_async_serialize
 interface changes

---
 libgomp/libgomp-plugin.h  |  4 ++--
 libgomp/plugin/plugin-nvptx.c | 29 +
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index e3c031a282a1..ce3ae125e208 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -115,8 +115,8 @@ extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *);
 extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
 extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
 extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
-extern void GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *);
-extern void GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
+extern bool GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *);
+extern bool GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
 		  struct goacc_asyncqueue *);
 extern void GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *,
 		   void (*)(void *), void *);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index f42cbf488a79..12f87ba7be4d 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1395,22 +1395,35 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
   return -1;
 }
 
-void
+bool
 GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
 {
-  //TODO Is this safe to call, or might this cause deadlock if something's locked?
-  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+  CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, aq->cuda_stream);
+  return r == CUDA_SUCCESS;
 }
 
-void
+bool
 GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
   struct goacc_asyncqueue *aq2)
 {
+  CUresult r;
   

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v2)

2019-01-07 Thread Thomas Schwinge
Hi Chung-Lin!

On Wed, 2 Jan 2019 20:46:12 +0800, Chung-Lin Tang  
wrote:
> Hi Thomas, Happy New Year,

Thanks!  If I remember right, you still have a few weeks until "your" New
Year/Spring Festival, right?


> On 2018/12/19 5:03 AM, Thomas Schwinge wrote:
> >> +
> >> +  if (!dev->openacc.async.asyncqueue[async])
> >> +{
> >> +  dev->openacc.async.asyncqueue[async] = 
> >> dev->openacc.async.construct_func ();
> >> +
> >> +  if (!dev->openacc.async.asyncqueue[async])
> >> +  {
> >> +gomp_mutex_unlock (>openacc.async.lock);
> >> +gomp_fatal ("async %d creation failed", async);
> >> +  }
> > That will now always fail for host fallback, where
> > "host_openacc_async_construct" just always does "return NULL".
> > 
> > Actually, if the device doesn't support asyncqueues, this whole function
> > should turn into some kind of no-op, so that we don't again and again try
> > to create a new one for every call to "lookup_goacc_asyncqueue".
> > 
> > I'm attaching one possible solution.  I think it's fine to assume that
> > the majority of devices will support asyncqueues, and for those that
> > don't, this is just a one-time overhead per async-argument.  So, no
> > special handling required in "lookup_goacc_asyncqueue".
> 
> > --- a/libgomp/oacc-host.c
> > +++ b/libgomp/oacc-host.c
> > @@ -212,7 +212,8 @@ host_openacc_async_queue_callback (struct 
> > goacc_asyncqueue *aq
> >   static struct goacc_asyncqueue *
> >   host_openacc_async_construct (void)
> >   {
> > -  return NULL;
> > +  /* We have to return non-NULL here, but it's OK to use a dummy.  */
> > +  return (struct goacc_asyncqueue *) -1;
> >   }
> 
> I'm not sure I understand the meaning of this? Is there any use to 
> segfaulting somewhere else
> due to this 0x... pointer?

There will be no such dereferencing (and thus no such segfault), as you
(quite nicely!) made this is an opaque data type to the generic code.
The concrete type is specific to, and only ever dereferenced inside each
plugin, and the "host plugin" never dereferences it, so returning minus
one here only serves as a non-NULL value/identifier to the generic code.

> A feature of a NULL asyncqueue should mean that it is simply synchronous

OK, then that should be documented, and as I mentioned above, the
"lookup" code be adjusted so that it doesn't again and again try to
create an asyncqueue when the "construct" function returns NULL.

> however this does somewhat
> conflict with the case of async.construct_func() returning NULL on error...
> 
> Perhaps, again using an explicit success code as the return value (and return 
> asyncqueue using
> an out parameter)?

Sure, that's also fine.  I just did it as presented above, because of its
simplicity, and to avoid adjusting the "lookup" code, as mentioned above.


Grüße
 Thomas


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v4)

2019-01-05 Thread Chung-Lin Tang

Hi Thomas,
this is the current version of the oacc-* parts of the Async Re-work patch.

I have reverted away from the earlier mentioned attempt of using lockless
techniques to manage the asyncqueues; it is really hard to do in a 100% correct
manner, unless we only use something like simple lists to manage them,
which probably makes lookup unacceptably slow.

For now, I have changed to use the conventional locking and success/fail return
codes for the synchronize/serialize hooks. I hope this is enough to pass
and get committed.

Thanks,
Chung-Lin

Index: oacc-async.c
===
--- oacc-async.c(revision 267507)
+++ oacc-async.c(working copy)
@@ -27,10 +27,99 @@
.  */
 
 #include 
+#include 
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
 
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+gomp_fatal ("no device active");
+
+  return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+gomp_fatal ("no device active");
+
+  return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream.  */
+  if (async == acc_async_noval)
+async = thr->default_async;
+
+  if (async == acc_async_sync)
+return NULL;
+
+  if (async < 0)
+gomp_fatal ("bad async %d", async);
+
+  struct goacc_asyncqueue *ret_aq = NULL;
+  struct gomp_device_descr *dev = thr->dev;
+
+  gomp_mutex_lock (>openacc.async.lock);
+
+  if (!create
+  && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+goto end;
+
+  if (async >= dev->openacc.async.nasyncqueue)
+{
+  int diff = async + 1 - dev->openacc.async.nasyncqueue;
+  dev->openacc.async.asyncqueue
+   = gomp_realloc (dev->openacc.async.asyncqueue,
+   sizeof (goacc_aq) * (async + 1));
+  memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+ 0, sizeof (goacc_aq) * diff);
+  dev->openacc.async.nasyncqueue = async + 1;
+}
+
+  if (!dev->openacc.async.asyncqueue[async])
+{
+  dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func 
();
+
+  if (!dev->openacc.async.asyncqueue[async])
+   {
+ gomp_mutex_unlock (>openacc.async.lock);
+ gomp_fatal ("async %d creation failed", async);
+   }
+  
+  /* Link new async queue into active list.  */
+  goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+  n->aq = dev->openacc.async.asyncqueue[async];
+  n->next = dev->openacc.async.active;
+  dev->openacc.async.active = n;
+}
+
+  ret_aq = dev->openacc.async.asyncqueue[async];
+
+ end:
+  gomp_mutex_unlock (>openacc.async.lock);
+  return ret_aq;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+get_goacc_asyncqueue (int async)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return lookup_goacc_asyncqueue (thr, true, async);
+}
+
 int
 acc_async_test (int async)
 {
@@ -42,18 +131,25 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
 gomp_fatal ("no device active");
 
-  return thr->dev->openacc.async_test_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  return thr->dev->openacc.async.test_func (aq);
 }
 
 int
 acc_async_test_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-gomp_fatal ("no device active");
-
-  return thr->dev->openacc.async_test_all_func ();
+  int ret = 1;
+  gomp_mutex_lock (>dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+if (!thr->dev->openacc.async.test_func (l->aq))
+  {
+   ret = 0;
+   break;
+  }
+  gomp_mutex_unlock (>dev->openacc.async.lock);
+  return ret;
 }
 
 void
@@ -62,12 +158,10 @@ acc_wait (int async)
   if (!async_valid_p (async))
 gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-gomp_fatal ("no device active");
-
-  thr->dev->openacc.async_wait_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
@@ -84,23 +178,34 @@ acc_async_wait (int async)
 void
 acc_wait_async (int async1, int async2)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-gomp_fatal ("no device 

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v2)

2019-01-02 Thread Chung-Lin Tang

Hi Thomas, Happy New Year,

On 2018/12/19 5:03 AM, Thomas Schwinge wrote:

+
+  if (!dev->openacc.async.asyncqueue[async])
+{
+  dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func 
();
+
+  if (!dev->openacc.async.asyncqueue[async])
+   {
+ gomp_mutex_unlock (>openacc.async.lock);
+ gomp_fatal ("async %d creation failed", async);
+   }

That will now always fail for host fallback, where
"host_openacc_async_construct" just always does "return NULL".

Actually, if the device doesn't support asyncqueues, this whole function
should turn into some kind of no-op, so that we don't again and again try
to create a new one for every call to "lookup_goacc_asyncqueue".

I'm attaching one possible solution.  I think it's fine to assume that
the majority of devices will support asyncqueues, and for those that
don't, this is just a one-time overhead per async-argument.  So, no
special handling required in "lookup_goacc_asyncqueue".



--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -212,7 +212,8 @@ host_openacc_async_queue_callback (struct goacc_asyncqueue 
*aq
  static struct goacc_asyncqueue *
  host_openacc_async_construct (void)
  {
-  return NULL;
+  /* We have to return non-NULL here, but it's OK to use a dummy.  */
+  return (struct goacc_asyncqueue *) -1;
  }


I'm not sure I understand the meaning of this? Is there any use to segfaulting 
somewhere else
due to this 0x... pointer?

A feature of a NULL asyncqueue should mean that it is simply synchronous, 
however this does somewhat
conflict with the case of async.construct_func() returning NULL on error...

Perhaps, again using an explicit success code as the return value (and return 
asyncqueue using
an out parameter)?

Chung-Lin


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v3)

2018-12-28 Thread Thomas Schwinge
Hi Chung-Lin!

On Sat, 22 Dec 2018 00:04:56 +0800, Chung-Lin Tang  
wrote:
> On 2018/12/19 5:03 AM, Thomas Schwinge wrote:
> > On Tue, 18 Dec 2018 23:06:38 +0800, Chung-Lin Tang 
> >  wrote:
> >> this part includes some of the lookup_goacc_asyncqueue fixes we talked 
> >> about.
> >> I am still thinking about how the queue lock problem should really be 
> >> solved, so regard
> >> this patch as just fixing some of the problems.
> 
> This is my solution to the queue lock stuff we talked about. I've only 
> attached a diff to
> be applied atop of the existing changes, though that may actually be easier 
> to review.
> 
> Note that this is still in testing, which means it hasn't been tested :P, but 
> I'm
> posting to see if you have time to give it a look before the holidays.
> 
> Having the need for a lock on the async queues is quite irritating, especially
> when the structure needed for managing them is quite simple.
>
> Therefore, lets do away the need for locks entirely.

OK, if that's easily possible, fine.  Though, I won't object to the
current "lock" version, if done properly in all the relevant places, as
pointed out.

(And, as that was an open issue, as far as I remember: I think its fine
to have "aq"s not protected by the async lock, because they will only be
destroyed by device shutdown, and at that point, all async operations
must be synchronized with the local (host) thread, so the are then "aq"s
idle.)

> This patch makes the asyncqueue part of the device->openacc.async managed by 
> lock-free
> atomic operations; almost all of the complexity is contained in 
> lookup_goacc_asyncqueue(),
> so it should be not too complex. A descriptor and the queue array is 
> allocated/exchanged
> atomically when more storage is required, while in the common case a simple 
> lookup is enough.
> The fact that we manage asyncqueues by only expanding and never destroying 
> asyncqueues
> during the device lifetime also simplifies many things.
> 
> The current implementation may be not that optimized and clunky in some 
> cases, but I think
> this should be the way to implement what should be a fairly simple asyncqueue 
> array and active
> list.

OK conceptually, but from the (very quick only!) look that I had, I did
not completely understand the data structure you're adding there, and
don't you also have to use atomic instructions for reading the lock-free
data structure ("aqdesc" etc.)?

> I'll update more on the status as testing proceeds.
> 
> (and about the other corners you noticed in the last mail, I'll get to that 
> later...)

OK, thanks.


Grüße
 Thomas


> >> --- libgomp/oacc-async.c   (revision 267226)
> >> +++ libgomp/oacc-async.c   (working copy)
> > 
> >> +attribute_hidden struct goacc_asyncqueue *
> >> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
> >> +{
> >> +[...]
> >> +  /* Link new async queue into active list.  */
> >> +  goacc_aq_list n = gomp_malloc (sizeof (struct 
> >> goacc_asyncqueue_list));
> >> +  n->aq = dev->openacc.async.asyncqueue[async];
> >> +  n->next = dev->openacc.async.active;
> >> +  dev->openacc.async.active = n;
> >> +}
> >> +  gomp_mutex_unlock (>openacc.async.lock);
> > 
> > You still need to keep "async" locked during...
> > 
> >> +  return dev->openacc.async.asyncqueue[async];
> > 
> > ... this dereference.
> > 
> >> +}

> --- trunk-orig/libgomp/libgomp.h  2018-12-20 23:24:20.040375724 +0800
> +++ trunk-work/libgomp/libgomp.h  2018-12-21 23:32:47.938628954 +0800
> @@ -937,6 +937,13 @@
>  
>  #include "splay-tree.h"
>  
> +struct goacc_asyncqueue_desc
> +{
> +  int nasyncqueue;
> +  struct goacc_asyncqueue **asyncqueue;
> +  struct goacc_asyncqueue_list *active;
> +};
> +
>  typedef struct acc_dispatch_t
>  {
>/* This is a linked list of data mapped using the
> @@ -955,10 +962,7 @@
>  *destroy_thread_data_func;
>
>struct {
> -gomp_mutex_t lock;
> -int nasyncqueue;
> -struct goacc_asyncqueue **asyncqueue;
> -struct goacc_asyncqueue_list *active;
> +struct goacc_asyncqueue_desc *aqdesc;
>  
>  __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func;
>  __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func;
> diff -ru trunk-orig/libgomp/oacc-async.c trunk-work/libgomp/oacc-async.c
> --- trunk-orig/libgomp/oacc-async.c   2018-12-18 22:19:51.923102938 +0800
> +++ trunk-work/libgomp/oacc-async.c   2018-12-21 23:40:36.088528890 +0800
> @@ -70,45 +70,84 @@
>  
>struct gomp_device_descr *dev = thr->dev;
>  
> -  gomp_mutex_lock (>openacc.async.lock);
> +  struct goacc_asyncqueue_desc *aqdesc = dev->openacc.async.aqdesc;
>  
> -  if (!create
> -  && (async >= dev->openacc.async.nasyncqueue
> -   || !dev->openacc.async.asyncqueue[async]))
> -{
> -  gomp_mutex_unlock (>openacc.async.lock);
> -  return NULL;
> -}
> +  if (!create)
> +return async < aqdesc->nasyncqueue ? aqdesc->asyncqueue[async] : NULL;
>  
> -  if (async >= 

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v3)

2018-12-21 Thread Chung-Lin Tang

On 2018/12/19 5:03 AM, Thomas Schwinge wrote:

Hi Chung-Lin!

On Tue, 18 Dec 2018 23:06:38 +0800, Chung-Lin Tang  
wrote:

this part includes some of the lookup_goacc_asyncqueue fixes we talked about.
I am still thinking about how the queue lock problem should really be solved, 
so regard
this patch as just fixing some of the problems.


Hi Thomas,
This is my solution to the queue lock stuff we talked about. I've only attached 
a diff to
be applied atop of the existing changes, though that may actually be easier to 
review.

Note that this is still in testing, which means it hasn't been tested :P, but 
I'm
posting to see if you have time to give it a look before the holidays.

Having the need for a lock on the async queues is quite irritating, especially
when the structure needed for managing them is quite simple.

Therefore, lets do away the need for locks entirely.

This patch makes the asyncqueue part of the device->openacc.async managed by 
lock-free
atomic operations; almost all of the complexity is contained in 
lookup_goacc_asyncqueue(),
so it should be not too complex. A descriptor and the queue array is 
allocated/exchanged
atomically when more storage is required, while in the common case a simple 
lookup is enough.
The fact that we manage asyncqueues by only expanding and never destroying 
asyncqueues
during the device lifetime also simplifies many things.

The current implementation may be not that optimized and clunky in some cases, 
but I think
this should be the way to implement what should be a fairly simple asyncqueue 
array and active
list. I'll update more on the status as testing proceeds.

(and about the other corners you noticed in the last mail, I'll get to that 
later...)

Thanks,
Chung-Lin








Sure, thanks.

Two comments, though:


--- libgomp/oacc-async.c(revision 267226)
+++ libgomp/oacc-async.c(working copy)



+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream.  */
+  if (async == acc_async_noval)
+async = thr->default_async;
+
+  if (async == acc_async_sync)
+return NULL;
+
+  if (async < 0)
+gomp_fatal ("bad async %d", async);
+
+  struct gomp_device_descr *dev = thr->dev;
+
+  gomp_mutex_lock (>openacc.async.lock);
+
+  if (!create
+  && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+{
+  gomp_mutex_unlock (>openacc.async.lock);
+  return NULL;
+}
+
+  if (async >= dev->openacc.async.nasyncqueue)
+{
+  int diff = async + 1 - dev->openacc.async.nasyncqueue;
+  dev->openacc.async.asyncqueue
+   = gomp_realloc (dev->openacc.async.asyncqueue,
+   sizeof (goacc_aq) * (async + 1));
+  memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+ 0, sizeof (goacc_aq) * diff);
+  dev->openacc.async.nasyncqueue = async + 1;
+}
+
+  if (!dev->openacc.async.asyncqueue[async])
+{
+  dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func 
();
+
+  if (!dev->openacc.async.asyncqueue[async])
+   {
+ gomp_mutex_unlock (>openacc.async.lock);
+ gomp_fatal ("async %d creation failed", async);
+   }


That will now always fail for host fallback, where
"host_openacc_async_construct" just always does "return NULL".

Actually, if the device doesn't support asyncqueues, this whole function
should turn into some kind of no-op, so that we don't again and again try
to create a new one for every call to "lookup_goacc_asyncqueue".

I'm attaching one possible solution.  I think it's fine to assume that
the majority of devices will support asyncqueues, and for those that
don't, this is just a one-time overhead per async-argument.  So, no
special handling required in "lookup_goacc_asyncqueue".


+  /* Link new async queue into active list.  */
+  goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+  n->aq = dev->openacc.async.asyncqueue[async];
+  n->next = dev->openacc.async.active;
+  dev->openacc.async.active = n;
+}
+  gomp_mutex_unlock (>openacc.async.lock);


You still need to keep "async" locked during...


+  return dev->openacc.async.asyncqueue[async];


... this dereference.


+}



Oh, and:


--- libgomp/oacc-plugin.c   (revision 267226)
+++ libgomp/oacc-plugin.c   (working copy)
@@ -31,14 +31,10 @@
  #include "oacc-int.h"
  
  void

-GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
+GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),
+ int async __attribute__((unused)))
  {
-  struct target_mem_desc *tgt = ptr;
-  struct gomp_device_descr *devicep = tgt->device_descr;
-
-  devicep->openacc.async_set_async_func (async);
-  gomp_unmap_vars (tgt, true);
-  devicep->openacc.async_set_async_func (acc_async_sync);

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v2)

2018-12-18 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 18 Dec 2018 23:06:38 +0800, Chung-Lin Tang  
wrote:
> this part includes some of the lookup_goacc_asyncqueue fixes we talked about.
> I am still thinking about how the queue lock problem should really be solved, 
> so regard
> this patch as just fixing some of the problems.

Sure, thanks.

Two comments, though:

> --- libgomp/oacc-async.c  (revision 267226)
> +++ libgomp/oacc-async.c  (working copy)

> +attribute_hidden struct goacc_asyncqueue *
> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
> +{
> +  /* The special value acc_async_noval (-1) maps to the thread-specific
> + default async stream.  */
> +  if (async == acc_async_noval)
> +async = thr->default_async;
> +
> +  if (async == acc_async_sync)
> +return NULL;
> +
> +  if (async < 0)
> +gomp_fatal ("bad async %d", async);
> +
> +  struct gomp_device_descr *dev = thr->dev;
> +
> +  gomp_mutex_lock (>openacc.async.lock);
> +
> +  if (!create
> +  && (async >= dev->openacc.async.nasyncqueue
> +   || !dev->openacc.async.asyncqueue[async]))
> +{
> +  gomp_mutex_unlock (>openacc.async.lock);
> +  return NULL;
> +}
> +
> +  if (async >= dev->openacc.async.nasyncqueue)
> +{
> +  int diff = async + 1 - dev->openacc.async.nasyncqueue;
> +  dev->openacc.async.asyncqueue
> + = gomp_realloc (dev->openacc.async.asyncqueue,
> + sizeof (goacc_aq) * (async + 1));
> +  memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
> +   0, sizeof (goacc_aq) * diff);
> +  dev->openacc.async.nasyncqueue = async + 1;
> +}
> +
> +  if (!dev->openacc.async.asyncqueue[async])
> +{
> +  dev->openacc.async.asyncqueue[async] = 
> dev->openacc.async.construct_func ();
> +
> +  if (!dev->openacc.async.asyncqueue[async])
> + {
> +   gomp_mutex_unlock (>openacc.async.lock);
> +   gomp_fatal ("async %d creation failed", async);
> + }

That will now always fail for host fallback, where
"host_openacc_async_construct" just always does "return NULL".

Actually, if the device doesn't support asyncqueues, this whole function
should turn into some kind of no-op, so that we don't again and again try
to create a new one for every call to "lookup_goacc_asyncqueue".

I'm attaching one possible solution.  I think it's fine to assume that
the majority of devices will support asyncqueues, and for those that
don't, this is just a one-time overhead per async-argument.  So, no
special handling required in "lookup_goacc_asyncqueue".

> +  /* Link new async queue into active list.  */
> +  goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
> +  n->aq = dev->openacc.async.asyncqueue[async];
> +  n->next = dev->openacc.async.active;
> +  dev->openacc.async.active = n;
> +}
> +  gomp_mutex_unlock (>openacc.async.lock);

You still need to keep "async" locked during...

> +  return dev->openacc.async.asyncqueue[async];

... this dereference.

> +}


Oh, and:

> --- libgomp/oacc-plugin.c (revision 267226)
> +++ libgomp/oacc-plugin.c (working copy)
> @@ -31,14 +31,10 @@
>  #include "oacc-int.h"
>  
>  void
> -GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
> +GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),
> +   int async __attribute__((unused)))
>  {
> -  struct target_mem_desc *tgt = ptr;
> -  struct gomp_device_descr *devicep = tgt->device_descr;
> -
> -  devicep->openacc.async_set_async_func (async);
> -  gomp_unmap_vars (tgt, true);
> -  devicep->openacc.async_set_async_func (acc_async_sync);
> +  gomp_fatal ("invalid plugin function");
>  }

Please add a comment here, something like: "Obsolete entry point, no
longer used."


Grüße
 Thomas


>From 4cb99c3691f95b6b299e7cb2603af36f723f9e8e Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Tue, 18 Dec 2018 21:58:41 +0100
Subject: [PATCH] into async re-work: adjust host_openacc_async_construct

---
 libgomp/oacc-host.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 727f8866f45c..cfd8a24f0674 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -212,7 +212,8 @@ host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
 static struct goacc_asyncqueue *
 host_openacc_async_construct (void)
 {
-  return NULL;
+  /* We have to return non-NULL here, but it's OK to use a dummy.  */
+  return (struct goacc_asyncqueue *) -1;
 }
 
 static bool
-- 
2.17.1



Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts (revised, v2)

2018-12-18 Thread Chung-Lin Tang

On 2018/9/25 9:10 PM, Chung-Lin Tang wrote:

Hi Thomas,
These are the OpenACC specific changes, mostly the re-implementation of 
async-related acc_* runtime
library API functions to use the new backend plugin interfaces, in a non-target 
specific way.



Hi Thomas,
this part includes some of the lookup_goacc_asyncqueue fixes we talked about.
I am still thinking about how the queue lock problem should really be solved, 
so regard
this patch as just fixing some of the problems.


diff -ru trunk-orig/libgomp/oacc-async.c trunk-work/libgomp/oacc-async.c
--- trunk-orig/libgomp/oacc-async.c 2018-12-14 22:11:29.252251925 +0800
+++ trunk-work/libgomp/oacc-async.c 2018-12-18 22:19:51.923102938 +0800
@@ -70,12 +70,16 @@
 
   struct gomp_device_descr *dev = thr->dev;
 
+  gomp_mutex_lock (>openacc.async.lock);
+
   if (!create
   && (async >= dev->openacc.async.nasyncqueue
  || !dev->openacc.async.asyncqueue[async]))
-return NULL;
+{
+  gomp_mutex_unlock (>openacc.async.lock);
+  return NULL;
+}
 
-  gomp_mutex_lock (>openacc.async.lock);
   if (async >= dev->openacc.async.nasyncqueue)
 {
   int diff = async + 1 - dev->openacc.async.nasyncqueue;
@@ -91,6 +95,12 @@
 {
   dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func 
();
 
+  if (!dev->openacc.async.asyncqueue[async])
+   {
+ gomp_mutex_unlock (>openacc.async.lock);
+ gomp_fatal ("async %d creation failed", async);
+   }
+  
   /* Link new async queue into active list.  */
   goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
   n->aq = dev->openacc.async.asyncqueue[async];
diff -ru trunk-orig/libgomp/oacc-host.c trunk-work/libgomp/oacc-host.c
--- trunk-orig/libgomp/oacc-host.c  2018-12-14 18:31:07.487203770 +0800
+++ trunk-work/libgomp/oacc-host.c  2018-12-18 22:23:26.771807667 +0800
@@ -266,6 +266,9 @@
 
   .exec_func = host_openacc_exec,
 
+  .create_thread_data_func = host_openacc_create_thread_data,
+  .destroy_thread_data_func = host_openacc_destroy_thread_data,
+
   .async = {
.construct_func = host_openacc_async_construct,
.destruct_func = host_openacc_async_destruct,
@@ -278,9 +281,6 @@
.host2dev_func = host_openacc_async_host2dev,
   },
 
-  .create_thread_data_func = host_openacc_create_thread_data,
-  .destroy_thread_data_func = host_openacc_destroy_thread_data,
-
   .cuda = {
.get_current_device_func = NULL,
.get_current_context_func = NULL,
diff -ru trunk-orig/libgomp/oacc-plugin.c trunk-work/libgomp/oacc-plugin.c
--- trunk-orig/libgomp/oacc-plugin.c2018-12-14 18:31:07.491203745 +0800
+++ trunk-work/libgomp/oacc-plugin.c2018-12-18 22:27:46.047722004 +0800
@@ -30,6 +30,13 @@
 #include "oacc-plugin.h"
 #include "oacc-int.h"
 
+void
+GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),
+ int async __attribute__((unused)))
+{
+  gomp_fatal ("invalid plugin function");
+}
+
 /* Return the target-specific part of the TLS data for the current thread.  */
 
 void *
diff -ru trunk-orig/libgomp/plugin/plugin-nvptx.c 
trunk-work/libgomp/plugin/plugin-nvptx.c
Index: libgomp/oacc-async.c
===
--- libgomp/oacc-async.c(revision 267226)
+++ libgomp/oacc-async.c(working copy)
@@ -27,10 +27,97 @@
.  */
 
 #include 
+#include 
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
 
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+gomp_fatal ("no device active");
+
+  return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+gomp_fatal ("no device active");
+
+  return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream.  */
+  if (async == acc_async_noval)
+async = thr->default_async;
+
+  if (async == acc_async_sync)
+return NULL;
+
+  if (async < 0)
+gomp_fatal ("bad async %d", async);
+
+  struct gomp_device_descr *dev = thr->dev;
+
+  gomp_mutex_lock (>openacc.async.lock);
+
+  if (!create
+  && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+{
+  gomp_mutex_unlock (>openacc.async.lock);
+  return NULL;
+}
+
+  if (async >= dev->openacc.async.nasyncqueue)
+{
+  int diff = async + 1 - dev->openacc.async.nasyncqueue;
+  dev->openacc.async.asyncqueue
+   = gomp_realloc (dev->openacc.async.asyncqueue,
+   sizeof (goacc_aq) * (async + 1));
+  memset (dev->openacc.async.asyncqueue + 

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-18 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 18 Dec 2018 18:02:54 +0800, Chung-Lin Tang  
wrote:
> On 2018/12/17 10:32 PM, Thomas Schwinge wrote:
> >> The reason there are deadlocks from inside the plugin on 
> >> GOMP_PLUGIN_fatal() is when we hold the
> >> struct gomp_device_descr's*device*  lock, which is also acquired when we 
> >> execute atexit device shutdown handlers, hence the deadlock.
> >>
> >> I don't think this is the case for the OpenACC entry points that grab at 
> >> the openacc.async.* hooks,
> > Ah, OK, I see.  (But I thought that method of deadlock had been fixed by
> > some structural changes, to have plugin functions call the
> > non-terminating "GOMP_PLUGIN_error" and return some error, instead of
> > calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's
> > another TODO item for later, separately...  Or, if that's actually the
> > case, that this has been fixed in the way I described, then should these
> > functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"
> > call "GOMP_PLUGIN_error", and then return an error code?)
> 
> You remembered correctly, although...
> 
> >> though I can audit them again if deemed so.
> > My understanding had been that deadlock may happen if we're inside some
> > of these async/wait/serialize/synchronize functions, with "async" locked,
> > then run into an error, then libgomp prepares to abort, and at that time
> > shuts down the device, which will shut down the asyncqueues
> > ("goacc_fini_asyncqueues"), which will again try to lock "async" -- which
> > it actually doesn't.  My misunderstanding, I guess?
> 
> ...at least now, you can see that goacc_fini_asyncqueues() does not attempt to
> acquire devicep->openacc.async.lock when doing cleanup.
> 
> Come to think of it, that might be a bug there. :P

Heh, I wondered about that, too.  ;-)

An asyncqueue as returned by "lookup_goacc_asyncqueue" itself is not
locked (and I suppose it shouldn't be, because that would be "too
much"?), so it may -- at least (only?) in a specially constructed test
case -- happen that an asyncqueue gets destructed
("goacc_fini_asyncqueues") while it's still in use?  (Don't know how the
CUDA Driver library thinks of that, for example.  Though, probably, that
scenario can only happen if the device used by a certain host thread is
shut down while an "async" operation is still running.

But, can we easily avoid that issue by calling
"openacc.async.synchronize_func" before "openacc.async.destruct_func"
(or, have the latter do that internally)?  Just have to make sure that
any such synchonization then doesn't raise (potentially) nested
"GOMP_PLUGIN_fatal" calls.  Hence the TODO comment I added in my "into
async re-work: locking concerns" commit, before the
"openacc.async.destruct_func" call: "Can/should/must we "synchronize"
here (how?), so as to make sure that no other operation on this
asyncqueue is going on while/after we've destructed it here?"

Probably an error-ignoring "cuStreamSynchronize" call before
"cuStreamDestroy" would be reasonable?


Oh, and don't we have another problem...  Consider an "acc_shutdown" run
by host thread 1, while another host thread 2 continues to use the
device-wide queues.  That "acc_shutdown" will call "gomp_fini_device",
which will call "goacc_fini_asyncqueues", which will happily destruct the
whole device-wide "async" data.  Just taking the "async" lock before
doing that won't solve the problem, as host thread 2 is supposed to
continue using the existing queues.  Reference counting required?

Anyway: I'm not asking you to fix that now.  "Fortunately", we're not at
all properly implementing OpenACC usage in context of multiple host
threads (such as created by OpenMP or the pthreads interface), so I'm
just noting that issue now, to be resolved later (as part of our internal
tracker issues CSTS-110 or CSTS-115).


Anyway:

> >> "If there are two or more host threads executing and sharing the same 
> >> accelerator device,
> >> two asynchronous operations with the same async-value will be enqueued on 
> >> the same activity queue"
> > Right, but then, in the very next sentence, it goes on to state: "If the
> > threads are not synchronized with respect to each other, the operations
> > may be enqueued in either order and therefore may execute on the device
> > in either order".  So this, and given that:
> 
> I actually didn't care much about that next sentence, since it's just stating 
> the obvious :)

;-)

> It also seem to imply that the multiple host threads are enqueuing operations 
> to the same async queue, hence further
> corroborating that queues are device-wide, not thread.

OK, that's your (certainly valid) interpretation; mine was to make our
life simpler:

> >> That said, I recall most (if not all) of the synchronization operations 
> >> and behavior are all
> >> defined to be with respect to operations of the local host thread only, so 
> >> the spec mentioning interaction with
> >> other host threads here may be moot, as there's no 

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-18 Thread Chung-Lin Tang

On 2018/12/17 10:32 PM, Thomas Schwinge wrote:

The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() is 
when we hold the
struct gomp_device_descr's*device*  lock, which is also acquired when we 
execute atexit device shutdown handlers, hence the deadlock.

I don't think this is the case for the OpenACC entry points that grab at the 
openacc.async.* hooks,

Ah, OK, I see.  (But I thought that method of deadlock had been fixed by
some structural changes, to have plugin functions call the
non-terminating "GOMP_PLUGIN_error" and return some error, instead of
calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's
another TODO item for later, separately...  Or, if that's actually the
case, that this has been fixed in the way I described, then should these
functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"
call "GOMP_PLUGIN_error", and then return an error code?)


You remembered correctly, although...


though I can audit them again if deemed so.

My understanding had been that deadlock may happen if we're inside some
of these async/wait/serialize/synchronize functions, with "async" locked,
then run into an error, then libgomp prepares to abort, and at that time
shuts down the device, which will shut down the asyncqueues
("goacc_fini_asyncqueues"), which will again try to lock "async" -- which
it actually doesn't.  My misunderstanding, I guess?


...at least now, you can see that goacc_fini_asyncqueues() does not attempt to
acquire devicep->openacc.async.lock when doing cleanup.

Come to think of it, that might be a bug there. :P


"If there are two or more host threads executing and sharing the same 
accelerator device,
two asynchronous operations with the same async-value will be enqueued on the same 
activity queue"

Right, but then, in the very next sentence, it goes on to state: "If the
threads are not synchronized with respect to each other, the operations
may be enqueued in either order and therefore may execute on the device
in either order".  So this, and given that:


I actually didn't care much about that next sentence, since it's just stating 
the obvious :)

It also seem to imply that the multiple host threads are enqueuing operations 
to the same async queue, hence further
corroborating that queues are device-wide, not thread.


That said, I recall most (if not all) of the synchronization operations and 
behavior are all
defined to be with respect to operations of the local host thread only, so the 
spec mentioning interaction with
other host threads here may be moot, as there's no way meaningful way to 
synchronize with
the OpenACC activity of other host threads (though correct me if I forgot some 
API)

..., I concluded something must be wrong in the OpenACC 2.6,
2.16.1. "async clause" text, and no such (host-side) inter-thread
synchronization can be expected from OpenACC "async"/"wait".  I've also
already got that on my list of things to clarify with the OpenACC
technical committee, later on.


I just remembered, there does seem to be one area where device vs. thread-wide 
interpretation will be visible:
when using acc_get/set_cuda_stream(). Supposedly, given the specification's 
device-wide queue/stream model,
different host-threads should access the same CUDA stream when using 
acc_get/set_cuda_stream().
This will break if we made async queues to be thread-local.


Also, CUDA streams do not seem to support local-thread-operation-only 
synchronization.
I remember this was an issue in the old implementation inside the nvptx plugin 
as well, and we
had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" 
threaded)

Right.


Well, another issue that we might want to bring up to the OpenACC committee:)
I agree that if async queues spaces were simply thread-local then things would 
be much simpler.

OK, so you agree with that, good.

And, no problem foreseeable about simply moving the asyncqueues into
"goacc_thread" -- and removing the "async" lock?


I think we should still try to solve the potential deadlock problems, and stay 
close to the current
implementation just for now. We can ask the committee for further guidance 
later.

Chung-Lin


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-18 Thread Chung-Lin Tang

On 2018/12/17 9:52 PM, Thomas Schwinge wrote:

Hi Chung-Lin!

On Fri, 14 Dec 2018 22:52:44 +0800, Chung-Lin Tang  
wrote:

On 2018/12/14 10:17 PM, Thomas Schwinge wrote:

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
wrote:

--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c



+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream.  */
+  if (async == acc_async_noval)
+async = thr->default_async;
+
+  if (async == acc_async_sync)
+return NULL;
+
+  if (async < 0)
+gomp_fatal ("bad async %d", async);


To make this "resolve" part more obvious, that is, the translation from
the "async" argument to an "asyncqueue" array index:


+  if (!create
+  && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+return NULL;
+[...]


..., I propose adding a "async2id" function for that, and then rename all
"asyncqueue[async]" to "asyncqueue[id]".


I don't think this is needed. This is the only place in the entire runtime that
does asyncqueue indexing, adding more conceptual layers of re-directed indexing
seems unneeded.


It makes the code better understandable?  Or, curious, why do you think
that the translation from an OpenACC async-argument to an internal
asyncqueue ID should not be a separate function?


Because the index is (1) not used elsewhere; nor supposed to really, 
lookup_goacc_asyncqueue()
is intended to be the centralized place for looking up async queues.
and (2) the special async number case handling here is really short, creating 
another
conceptual index-redirecting in the code feels like over-engineering.


I do think the more descriptive comments are nice though.




And, this also restores the current trunk behavior, so that
"acc_async_noval" gets its own, separate "asyncqueue".


Is there a reason we need to restore that behavior right now?


Because otherwise that's a functional change ("regression") from the
current GCC trunk behavior, which I wouldn't expect in a re-work.


Okay, but do take note that the acc_get/set_default_async is part of the 
upstreaming too.
The behavior change is due to that new 2.5 functionality, not really because I 
arbitrarily changed things.

Thanks,
Chung-Lin



Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-17 Thread Thomas Schwinge
Hi Chung-Lin!

On Mon, 17 Dec 2018 19:03:12 +0800, Chung-Lin Tang  
wrote:
> On 2018/12/14 10:56 PM, Thomas Schwinge wrote:
> > +  //TODO Are these safe to call, or might this cause deadlock if 
> > something's locked?
> > CUDA_CALL_ASSERT (cuEventCreate, , CU_EVENT_DISABLE_TIMING);
> > CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
> > CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
> > @@ -1413,6 +1416,7 @@ static void
> >   cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
> >   {
> > if (res != CUDA_SUCCESS)
> > +//TODO Is this safe to call, or might this cause deadlock if 
> > something's locked?
> >   GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
> 
> The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() 
> is when we hold the
> struct gomp_device_descr's *device* lock, which is also acquired when we 
> execute atexit device shutdown handlers, hence the deadlock.
> 
> I don't think this is the case for the OpenACC entry points that grab at the 
> openacc.async.* hooks,

Ah, OK, I see.  (But I thought that method of deadlock had been fixed by
some structural changes, to have plugin functions call the
non-terminating "GOMP_PLUGIN_error" and return some error, instead of
calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's
another TODO item for later, separately...  Or, if that's actually the
case, that this has been fixed in the way I described, then should these
functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"
call "GOMP_PLUGIN_error", and then return an error code?)

> though I can audit them again if deemed so.

My understanding had been that deadlock may happen if we're inside some
of these async/wait/serialize/synchronize functions, with "async" locked,
then run into an error, then libgomp prepares to abort, and at that time
shuts down the device, which will shut down the asyncqueues
("goacc_fini_asyncqueues"), which will again try to lock "async" -- which
it actually doesn't.  My misunderstanding, I guess?


> > But then, I wonder if we couldn't skip all that locking, if we moved the
> > "asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?
> > 
> 
> > struct {
> > +//TODO Why do these live in the "device" data structure, and not in 
> > the "per-thread" data structure?
> > +//TODO Aren't they meant to be separate per thread?
> > +//TODO That is, as far as I remember right now, OpenACC explicitly 
> > states that an asyncqueue doesn't entail any synchronization between 
> > different host threads.
> > +//TODO Verify OpenACC.
> > +//TODO With that moved into "goacc_thread", we could get rid of all 
> > the locking needed here?
> >   /* Once created and put into the "active" list, asyncqueues are then 
> > never
> >  destructed and removed from the "active" list, other than if the 
> > TODO
> >  device is shut down.  */
> > 
> > At this point, I will again (as in that other email) state that my
> > understanding of OpenACC is that an async queue does not entail any
> > inter-thread synchronization, so it would seem reasonable that all async
> > queues are separate per thread.
> 
> OpenACC 2.6 specification, section 2.16.1 "async clause (page 68, line 2029):
> 
> "If there are two or more host threads executing and sharing the same 
> accelerator device,
> two asynchronous operations with the same async-value will be enqueued on the 
> same activity queue"

Right, but then, in the very next sentence, it goes on to state: "If the
threads are not synchronized with respect to each other, the operations
may be enqueued in either order and therefore may execute on the device
in either order".  So this, and given that:

> That said, I recall most (if not all) of the synchronization operations and 
> behavior are all
> defined to be with respect to operations of the local host thread only, so 
> the spec mentioning interaction with
> other host threads here may be moot, as there's no way meaningful way to 
> synchronize with
> the OpenACC activity of other host threads (though correct me if I forgot 
> some API)

..., I concluded something must be wrong in the OpenACC 2.6,
2.16.1. "async clause" text, and no such (host-side) inter-thread
synchronization can be expected from OpenACC "async"/"wait".  I've also
already got that on my list of things to clarify with the OpenACC
technical committee, later on.

> Also, CUDA streams do not seem to support local-thread-operation-only 
> synchronization.
> I remember this was an issue in the old implementation inside the nvptx 
> plugin as well, and we
> had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" 
> threaded)

Right.

> Well, another issue that we might want to bring up to the OpenACC committee :)
> I agree that if async queues spaces were simply thread-local then things 
> would be much simpler.

OK, so you agree with that, good.

And, no problem 

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-17 Thread Thomas Schwinge
Hi Chung-Lin!

On Fri, 14 Dec 2018 23:00:57 +0800, Chung-Lin Tang  
wrote:
> On 2018/12/14 10:53 PM, Thomas Schwinge wrote:
> > Please comment on the one TODO
> > which before your async re-work also was -- incorrectly? -- run
> > asynchronously?

> > @@ -563,6 +563,8 @@ GOACC_update (int device, size_t mapnum,
> >  the value of the allocated device memory in the
> >  previous pointer.  */
> >   *(uintptr_t *) hostaddrs[i] = (uintptr_t)dptr;
> > + /* This is intentionally no calling acc_update_device_async,
> > +because TODO.  */
> >   acc_update_device (hostaddrs[i], sizeof (uintptr_t));
> >   
> >   /* Restore the host pointer.  */
| *(uintptr_t *) hostaddrs[i] = t;
> 
> I don't remember adding this piece of comment, it might have been Cesar I 
> guess?

That "TODO" comment, you mean?  It was me who just added that one, to
highlight this, to ask you to comment, whether it's feasible to turn that
"acc_update_device" into "acc_update_device_async", too, or if that has
intentionally not been done, given that before your async re-work that
also was -- incorrectly? -- run asynchronously.

> I'm not sure if there's any real reason not to use acc_update_device_async 
> here...

My worry was that the data object being copied here ("*hostaddrs[i]") is
changed immediatelly after the "acc_update_device" call (now inlined
above), so if that update get enqueued for asynchronous execution, it
might then actually copy the value after "Restore the host pointer".

Given the code before and after it, maybe "acc_update_device" is
generally the wrong function to call here?  (That's, again, a separate
change, please.)

I have not yet researched where that code is coming from, and what it's
supposed to be used for.  But as part of the async re-work we should at
least understand that one, too.

> Change and test to see?

Generally, when testing asynchronous behavior, I'd be wary of reading too
much into any such test results ("still PASSes" -- by chance?).  Unless,
of course, there's then a clear regression somewhere ("now FAILs").

..., which there isn't, because adding a "gomp_fatal" in that code path,
that doesn't trigger a single time when running the libgomp testsuite...


Grüße
 Thomas


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-17 Thread Thomas Schwinge
Hi Chung-Lin!

On Fri, 14 Dec 2018 22:42:28 +0800, Chung-Lin Tang  
wrote:
> On 2018/12/14 10:32 PM, Thomas Schwinge wrote:
> > Invoked as "acc_wait_async ([...], acc_async_sync)" (as used in a test
> > case that I'll soon submit/commit), we'll end up with "aq2 == NULL", and
> > will segfault in the nvptx "openacc.async.serialize_func".
> 
> What does "wait async(acc_async_sync)" supposed to mean?

In my understanding, that'll translate to just "wait" without an "async"
clause, thus synchronous with the local (host) thread.

> Instead of fixing
> it here, will it make more sense to have the serialize_func hook to 
> accommodate
> the NULL asyncqueue?

Sure, that may make sense, yes.  Right: if there's no asyncqueue to
serialize with, then serialize/synchronize with the local (host) thread.


Grüße
 Thomas


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-17 Thread Thomas Schwinge
Hi Chung-Lin!

On Fri, 14 Dec 2018 22:52:44 +0800, Chung-Lin Tang  
wrote:
> On 2018/12/14 10:17 PM, Thomas Schwinge wrote:
> > On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang 
> >  wrote:
> >> --- a/libgomp/oacc-async.c
> >> +++ b/libgomp/oacc-async.c
> > 
> >> +attribute_hidden struct goacc_asyncqueue *
> >> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
> >> +{
> >> +  /* The special value acc_async_noval (-1) maps to the thread-specific
> >> + default async stream.  */
> >> +  if (async == acc_async_noval)
> >> +async = thr->default_async;
> >> +
> >> +  if (async == acc_async_sync)
> >> +return NULL;
> >> +
> >> +  if (async < 0)
> >> +gomp_fatal ("bad async %d", async);
> > 
> > To make this "resolve" part more obvious, that is, the translation from
> > the "async" argument to an "asyncqueue" array index:
> > 
> >> +  if (!create
> >> +  && (async >= dev->openacc.async.nasyncqueue
> >> +|| !dev->openacc.async.asyncqueue[async]))
> >> +return NULL;
> >> +[...]
> > 
> > ..., I propose adding a "async2id" function for that, and then rename all
> > "asyncqueue[async]" to "asyncqueue[id]".
> 
> I don't think this is needed. This is the only place in the entire runtime 
> that
> does asyncqueue indexing, adding more conceptual layers of re-directed 
> indexing
> seems unneeded.

It makes the code better understandable?  Or, curious, why do you think
that the translation from an OpenACC async-argument to an internal
asyncqueue ID should not be a separate function?


> I do think the more descriptive comments are nice though.


> > And, this also restores the current trunk behavior, so that
> > "acc_async_noval" gets its own, separate "asyncqueue".
> 
> Is there a reason we need to restore that behavior right now?

Because otherwise that's a functional change ("regression") from the
current GCC trunk behavior, which I wouldn't expect in a re-work.


Grüße
 Thomas


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-17 Thread Chung-Lin Tang

On 2018/12/14 10:56 PM, Thomas Schwinge wrote:

Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
wrote:

--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream.  */
+  if (async == acc_async_noval)
+async = thr->default_async;
+
+  if (async == acc_async_sync)
+return NULL;
+
+  if (async < 0)
+gomp_fatal ("bad async %d", async);
+
+  struct gomp_device_descr *dev = thr->dev;
+
+  if (!create
+  && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+return NULL;
+

Doesn't this last block also have to be included in the lock you're
taking below?


Good catch, I'll revise this.


-  gomp_mutex_lock (>openacc.async.lock);
if (id >= dev->openacc.async.nasyncqueue)
  {
int diff = id + 1 - dev->openacc.async.nasyncqueue;
+  // TODO gomp_realloc might call "gomp_fatal" with 
">openacc.async.lock" locked.  Might cause deadlock?
dev->openacc.async.asyncqueue
= gomp_realloc (dev->openacc.async.asyncqueue,
sizeof (goacc_aq) * (id + 1));
@@ -105,16 +109,23 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool 
create, int async)
  
if (!dev->openacc.async.asyncqueue[id])

  {
+  //TODO We have ">openacc.async.lock" locked here, and if "openacc.async.construct_func" 
calls "GOMP_PLUGIN_fatal" (via "CUDA_CALL_ASSERT", for example), that might cause deadlock?
+  //TODO Change the interface to emit an error in the plugin, but then "return 
NULL", and we catch that here, unlock, and bail out?
dev->openacc.async.asyncqueue[id] = dev->openacc.async.construct_func 
();



+  //TODO Are these safe to call, or might this cause deadlock if something's 
locked?
CUDA_CALL_ASSERT (cuEventCreate, , CU_EVENT_DISABLE_TIMING);
CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
@@ -1413,6 +1416,7 @@ static void
  cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
  {
if (res != CUDA_SUCCESS)
+//TODO Is this safe to call, or might this cause deadlock if something's 
locked?
  GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));


The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() is 
when we hold the
struct gomp_device_descr's *device* lock, which is also acquired when we 
execute atexit device shutdown handlers, hence the deadlock.

I don't think this is the case for the OpenACC entry points that grab at the 
openacc.async.* hooks,
though I can audit them again if deemed so.


But then, I wonder if we couldn't skip all that locking, if we moved the
"asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?




struct {
+//TODO Why do these live in the "device" data structure, and not in the 
"per-thread" data structure?
+//TODO Aren't they meant to be separate per thread?
+//TODO That is, as far as I remember right now, OpenACC explicitly states 
that an asyncqueue doesn't entail any synchronization between different host 
threads.
+//TODO Verify OpenACC.
+//TODO With that moved into "goacc_thread", we could get rid of all the 
locking needed here?
  /* Once created and put into the "active" list, asyncqueues are then never
 destructed and removed from the "active" list, other than if the TODO
 device is shut down.  */

At this point, I will again (as in that other email) state that my
understanding of OpenACC is that an async queue does not entail any
inter-thread synchronization, so it would seem reasonable that all async
queues are separate per thread.


OpenACC 2.6 specification, section 2.16.1 "async clause (page 68, line 2029):

"If there are two or more host threads executing and sharing the same 
accelerator device,
two asynchronous operations with the same async-value will be enqueued on the same 
activity queue"

That said, I recall most (if not all) of the synchronization operations and 
behavior are all
defined to be with respect to operations of the local host thread only, so the 
spec mentioning interaction with
other host threads here may be moot, as there's no way meaningful way to 
synchronize with
the OpenACC activity of other host threads (though correct me if I forgot some 
API)

Also, CUDA streams do not seem to support local-thread-operation-only 
synchronization.
I remember this was an issue in the old implementation inside the nvptx plugin 
as well, and we
had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" 
threaded)

Well, another issue that we might want to bring up to the OpenACC committee :)
I agree that if async queues spaces were simply thread-local then things would 
be much simpler.

Thanks,
Chung-Lin


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-14 Thread Chung-Lin Tang

On 2018/12/14 10:53 PM, Thomas Schwinge wrote:

Additionally the following, or why not?  Please comment on the one TODO
which before your async re-work also was -- incorrectly? -- run
asynchronously?




diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index 5a441c9efe38..91875c57fc97 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -413,11 +413,11 @@ GOACC_enter_exit_data (int device, size_t mapnum,
{
case GOMP_MAP_ALLOC:
case GOMP_MAP_FORCE_ALLOC:
- acc_create (hostaddrs[i], sizes[i]);
+ acc_create_async (hostaddrs[i], sizes[i], async);
  break;
case GOMP_MAP_TO:
case GOMP_MAP_FORCE_TO:
- acc_copyin (hostaddrs[i], sizes[i]);
+ acc_copyin_async (hostaddrs[i], sizes[i], async);
  break;
default:


Yes! I think these were somehow missed by mistake. Thanks for catching!


  gomp_fatal (" GOACC_enter_exit_data UNHANDLED kind 
0x%.2x",
@@ -563,6 +563,8 @@ GOACC_update (int device, size_t mapnum,
 the value of the allocated device memory in the
 previous pointer.  */
  *(uintptr_t *) hostaddrs[i] = (uintptr_t)dptr;
+ /* This is intentionally no calling acc_update_device_async,
+because TODO.  */
  acc_update_device (hostaddrs[i], sizeof (uintptr_t));
  
  	  /* Restore the host pointer.  */


I don't remember adding this piece of comment, it might have been Cesar I guess?
I'm not sure if there's any real reason not to use acc_update_device_async 
here...
Change and test to see?

Thanks,
Chung-Lin


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-14 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
wrote:
> --- a/libgomp/oacc-async.c
> +++ b/libgomp/oacc-async.c

> +attribute_hidden struct goacc_asyncqueue *
> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
> +{
> +  /* The special value acc_async_noval (-1) maps to the thread-specific
> + default async stream.  */
> +  if (async == acc_async_noval)
> +async = thr->default_async;
> +
> +  if (async == acc_async_sync)
> +return NULL;
> +
> +  if (async < 0)
> +gomp_fatal ("bad async %d", async);
> +
> +  struct gomp_device_descr *dev = thr->dev;
> +
> +  if (!create
> +  && (async >= dev->openacc.async.nasyncqueue
> +   || !dev->openacc.async.asyncqueue[async]))
> +return NULL;
> +

Doesn't this last block also have to be included in the lock you're
taking below?

> +  gomp_mutex_lock (>openacc.async.lock);
> +  if (async >= dev->openacc.async.nasyncqueue)
> +{
> +  int diff = async + 1 - dev->openacc.async.nasyncqueue;
> +  dev->openacc.async.asyncqueue
> + = gomp_realloc (dev->openacc.async.asyncqueue,
> + sizeof (goacc_aq) * (async + 1));
> +  memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
> +   0, sizeof (goacc_aq) * diff);
> +  dev->openacc.async.nasyncqueue = async + 1;
> +}
> +
> +  if (!dev->openacc.async.asyncqueue[async])
> +{
> +  dev->openacc.async.asyncqueue[async] = 
> dev->openacc.async.construct_func ();
> +
> +  /* Link new async queue into active list.  */
> +  goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
> +  n->aq = dev->openacc.async.asyncqueue[async];
> +  n->next = dev->openacc.async.active;
> +  dev->openacc.async.active = n;
> +}
> +  gomp_mutex_unlock (>openacc.async.lock);
> +  return dev->openacc.async.asyncqueue[async];
> +}

And then, some more concerns, as encoded in the following patch (but
please also continue reading below):

commit d2d6aaeca840debbec14e421be705ef56d444ac7
Author: Thomas Schwinge 
Date:   Wed Dec 12 15:57:30 2018 +0100

into async re-work: locking concerns
---
 libgomp/oacc-async.c  | 18 +++---
 libgomp/plugin/plugin-nvptx.c |  6 ++
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 89a405ebcdb1..68e4e65e8182 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -84,17 +84,21 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool 
create, int async)
   if (id < 0)
 return NULL;
 
+  struct goacc_asyncqueue *ret = NULL;
+
   struct gomp_device_descr *dev = thr->dev;
 
+  gomp_mutex_lock (>openacc.async.lock);
+
   if (!create
   && (id >= dev->openacc.async.nasyncqueue
  || !dev->openacc.async.asyncqueue[id]))
-return NULL;
+goto out;
 
-  gomp_mutex_lock (>openacc.async.lock);
   if (id >= dev->openacc.async.nasyncqueue)
 {
   int diff = id + 1 - dev->openacc.async.nasyncqueue;
+  // TODO gomp_realloc might call "gomp_fatal" with 
">openacc.async.lock" locked.  Might cause deadlock?
   dev->openacc.async.asyncqueue
= gomp_realloc (dev->openacc.async.asyncqueue,
sizeof (goacc_aq) * (id + 1));
@@ -105,16 +109,23 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool 
create, int async)
 
   if (!dev->openacc.async.asyncqueue[id])
 {
+  //TODO We have ">openacc.async.lock" locked here, and if 
"openacc.async.construct_func" calls "GOMP_PLUGIN_fatal" (via 
"CUDA_CALL_ASSERT", for example), that might cause deadlock?
+  //TODO Change the interface to emit an error in the plugin, but then 
"return NULL", and we catch that here, unlock, and bail out?
   dev->openacc.async.asyncqueue[id] = dev->openacc.async.construct_func ();
 
   /* Link new async queue into active list.  */
+  // TODO gomp_malloc might call "gomp_fatal" with 
">openacc.async.lock" locked.  Might cause deadlock?
   goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
   n->aq = dev->openacc.async.asyncqueue[id];
   n->next = dev->openacc.async.active;
   dev->openacc.async.active = n;
 }
+  ret = dev->openacc.async.asyncqueue[id];
+
+ out:
   gomp_mutex_unlock (>openacc.async.lock);
-  return dev->openacc.async.asyncqueue[id];
+
+  return ret;
 }
 
 /* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
@@ -305,6 +316,7 @@ goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
   goacc_aq_list next;
   for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
{
+ //TODO Can/should/must we "synchronize" here (how?), so as to make 
sure that no other operation on this asyncqueue is going on while/after we've 
destructed it here?
  ret &= devicep->openacc.async.destruct_func (l->aq);
  next = l->next;
  free (l);
diff --git libgomp/plugin/plugin-nvptx.c 

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-14 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
wrote:
> --- a/libgomp/oacc-parallel.c
> +++ b/libgomp/oacc-parallel.c
> @@ -377,8 +360,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,
>   finalize = true;
>  }
>  
> -  acc_dev->openacc.async_set_async_func (async);
> -
>/* Determine if this is an "acc enter data".  */
>for (i = 0; i < mapnum; ++i)
>  {
> @@ -450,7 +431,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
> else
>   {
> gomp_acc_insert_pointer (pointer, [i],
> -[i], [i]);
> +[i], [i], async);
> /* Increment 'i' by two because OpenACC requires fortran
>arrays to be contiguous, so each PSET is associated with
>one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
> @@ -475,17 +456,17 @@ GOACC_enter_exit_data (int device, size_t mapnum,
>   if (acc_is_present (hostaddrs[i], sizes[i]))
> {
>   if (finalize)
> -   acc_delete_finalize (hostaddrs[i], sizes[i]);
> +   acc_delete_finalize_async (hostaddrs[i], sizes[i], async);
>   else
> -   acc_delete (hostaddrs[i], sizes[i]);
> +   acc_delete_async (hostaddrs[i], sizes[i], async);
> }
>   break;
> case GOMP_MAP_FROM:
> case GOMP_MAP_FORCE_FROM:
>   if (finalize)
> -   acc_copyout_finalize (hostaddrs[i], sizes[i]);
> +   acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);
>   else
> -   acc_copyout (hostaddrs[i], sizes[i]);
> +   acc_copyout_async (hostaddrs[i], sizes[i], async);
>   break;
> default:
>   gomp_fatal (" GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
> @@ -503,8 +484,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,
>   i += pointer - 1;
> }
>}
> -
> -  acc_dev->openacc.async_set_async_func (acc_async_sync);
>  }

Additionally the following, or why not?  Please comment on the one TODO
which before your async re-work also was -- incorrectly? -- run
asynchronously?

commit 34c9ce65ad1f9865d0716d18c364d8c6928e694c
Author: Thomas Schwinge 
Date:   Fri Dec 14 14:34:17 2018 +0100

into async re-work: more async function usage
---
 libgomp/oacc-parallel.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index 5a441c9efe38..91875c57fc97 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -413,11 +413,11 @@ GOACC_enter_exit_data (int device, size_t mapnum,
{
case GOMP_MAP_ALLOC:
case GOMP_MAP_FORCE_ALLOC:
- acc_create (hostaddrs[i], sizes[i]);
+ acc_create_async (hostaddrs[i], sizes[i], async);
  break;
case GOMP_MAP_TO:
case GOMP_MAP_FORCE_TO:
- acc_copyin (hostaddrs[i], sizes[i]);
+ acc_copyin_async (hostaddrs[i], sizes[i], async);
  break;
default:
  gomp_fatal (" GOACC_enter_exit_data UNHANDLED kind 
0x%.2x",
@@ -563,6 +563,8 @@ GOACC_update (int device, size_t mapnum,
 the value of the allocated device memory in the
 previous pointer.  */
  *(uintptr_t *) hostaddrs[i] = (uintptr_t)dptr;
+ /* This is intentionally no calling acc_update_device_async,
+because TODO.  */
  acc_update_device (hostaddrs[i], sizeof (uintptr_t));
 
  /* Restore the host pointer.  */


Grüße
 Thomas


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-14 Thread Chung-Lin Tang

On 2018/12/14 10:17 PM, Thomas Schwinge wrote:

Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
wrote:

--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c



+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream.  */
+  if (async == acc_async_noval)
+async = thr->default_async;
+
+  if (async == acc_async_sync)
+return NULL;
+
+  if (async < 0)
+gomp_fatal ("bad async %d", async);


To make this "resolve" part more obvious, that is, the translation from
the "async" argument to an "asyncqueue" array index:


+  if (!create
+  && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+return NULL;
+[...]


..., I propose adding a "async2id" function for that, and then rename all
"asyncqueue[async]" to "asyncqueue[id]".


I don't think this is needed. This is the only place in the entire runtime that
does asyncqueue indexing, adding more conceptual layers of re-directed indexing
seems unneeded.

I do think the more descriptive comments are nice though.


And, this also restores the current trunk behavior, so that
"acc_async_noval" gets its own, separate "asyncqueue".


Is there a reason we need to restore that behavior right now?

Thanks,
Chung-Lin



Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-14 Thread Chung-Lin Tang

On 2018/12/14 10:32 PM, Thomas Schwinge wrote:

Invoked as "acc_wait_async ([...], acc_async_sync)" (as used in a test
case that I'll soon submit/commit), we'll end up with "aq2 == NULL", and
will segfault in the nvptx "openacc.async.serialize_func".


What does "wait async(acc_async_sync)" supposed to mean? Instead of fixing
it here, will it make more sense to have the serialize_func hook to accommodate
the NULL asyncqueue?

Chung-Lin



Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-14 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
wrote:
>  void
>  acc_wait_async (int async1, int async2)
>  {
> +  struct goacc_thread *thr = get_goacc_thread ();
>  
> +  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
> +  goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
> +  if (!aq1)
> +gomp_fatal ("invalid async 1");
> +  if (aq1 == aq2)
> +gomp_fatal ("identical parameters");
>  
> +  thr->dev->openacc.async.synchronize_func (aq1);
> +  thr->dev->openacc.async.serialize_func (aq1, aq2);
>  }

Invoked as "acc_wait_async ([...], acc_async_sync)" (as used in a test
case that I'll soon submit/commit), we'll end up with "aq2 == NULL", and
will segfault in the nvptx "openacc.async.serialize_func".

Good to fix as follows?

commit 448ff855bd954a72b5edb19fc1f3d481833fcb59
Author: Thomas Schwinge 
Date:   Thu Dec 13 17:43:42 2018 +0100

into async re-work: adjust for test case added in "[PR88484] OpenACC wait 
directive without wait argument but with async clause"
---
 libgomp/oacc-async.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 7e61b5dc0a05..a38e42781aa0 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -196,7 +196,8 @@ acc_wait_async (int async1, int async2)
 gomp_fatal ("identical parameters");
 
   thr->dev->openacc.async.synchronize_func (aq1);
-  thr->dev->openacc.async.serialize_func (aq1, aq2);
+  if (aq2)
+thr->dev->openacc.async.serialize_func (aq1, aq2);
 }
 
 void


Grüße
 Thomas


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-14 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
wrote:
> --- a/libgomp/oacc-async.c
> +++ b/libgomp/oacc-async.c

> +attribute_hidden struct goacc_asyncqueue *
> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
> +{
> +  /* The special value acc_async_noval (-1) maps to the thread-specific
> + default async stream.  */
> +  if (async == acc_async_noval)
> +async = thr->default_async;
> +
> +  if (async == acc_async_sync)
> +return NULL;
> +
> +  if (async < 0)
> +gomp_fatal ("bad async %d", async);

To make this "resolve" part more obvious, that is, the translation from
the "async" argument to an "asyncqueue" array index:

> +  if (!create
> +  && (async >= dev->openacc.async.nasyncqueue
> +   || !dev->openacc.async.asyncqueue[async]))
> +return NULL;
> +[...]

..., I propose adding a "async2id" function for that, and then rename all
"asyncqueue[async]" to "asyncqueue[id]".

And, this also restores the current trunk behavior, so that
"acc_async_noval" gets its own, separate "asyncqueue".

commit e0d10cd744906c031af536bbf523ed6607370bf7
Author: Thomas Schwinge 
Date:   Wed Dec 12 15:22:29 2018 +0100

into async re-work: libgomp/oacc-async.c:async2id
---
 libgomp/oacc-async.c | 58 +++-
 1 file changed, 39 insertions(+), 19 deletions(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index c9b134ac3380..b091ba2460ac 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -54,53 +54,73 @@ get_goacc_thread_device (void)
   return thr->dev;
 }
 
-attribute_hidden struct goacc_asyncqueue *
-lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+/* Translate from an OpenACC async-argument to an internal asyncqueue ID, or -1
+   if no asyncqueue is to be used.  */
+
+static int
+async2id (int async)
 {
-  /* The special value acc_async_noval (-1) maps to the thread-specific
- default async stream.  */
-  if (async == acc_async_noval)
-async = 0; //TODO thr->default_async;
+  if (!async_valid_p (async))
+gomp_fatal ("invalid async-argument: %d", async);
 
   if (async == acc_async_sync)
+return -1;
+  else if (async == acc_async_noval)
+return 0;
+  else if (async >= 0)
+return 1 + async;
+  else
+__builtin_unreachable ();
+}
+
+/* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
+   might return NULL if no asyncqueue is to be used.  Otherwise, if CREATE,
+   create the asyncqueue if it doesn't exist yet.  */
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  int id = async2id (async);
+  if (id < 0)
 return NULL;
 
-  if (async < 0)
-gomp_fatal ("bad async %d", async);
-
   struct gomp_device_descr *dev = thr->dev;
 
   if (!create
-  && (async >= dev->openacc.async.nasyncqueue
- || !dev->openacc.async.asyncqueue[async]))
+  && (id >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[id]))
 return NULL;
 
   gomp_mutex_lock (>openacc.async.lock);
-  if (async >= dev->openacc.async.nasyncqueue)
+  if (id >= dev->openacc.async.nasyncqueue)
 {
-  int diff = async + 1 - dev->openacc.async.nasyncqueue;
+  int diff = id + 1 - dev->openacc.async.nasyncqueue;
   dev->openacc.async.asyncqueue
= gomp_realloc (dev->openacc.async.asyncqueue,
-   sizeof (goacc_aq) * (async + 1));
+   sizeof (goacc_aq) * (id + 1));
   memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
  0, sizeof (goacc_aq) * diff);
-  dev->openacc.async.nasyncqueue = async + 1;
+  dev->openacc.async.nasyncqueue = id + 1;
 }
 
-  if (!dev->openacc.async.asyncqueue[async])
+  if (!dev->openacc.async.asyncqueue[id])
 {
-  dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func 
();
+  dev->openacc.async.asyncqueue[id] = dev->openacc.async.construct_func ();
 
   /* Link new async queue into active list.  */
   goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
-  n->aq = dev->openacc.async.asyncqueue[async];
+  n->aq = dev->openacc.async.asyncqueue[id];
   n->next = dev->openacc.async.active;
   dev->openacc.async.active = n;
 }
   gomp_mutex_unlock (>openacc.async.lock);
-  return dev->openacc.async.asyncqueue[async];
+  return dev->openacc.async.asyncqueue[id];
 }
 
+/* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
+   might return NULL if no asyncqueue is to be used.  Otherwise, create the
+   asyncqueue if it doesn't exist yet.  */
+
 attribute_hidden struct goacc_asyncqueue *
 get_goacc_asyncqueue (int async)
 {


Grüße
 Thomas


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-14 Thread Thomas Schwinge
Hi Chung-Lin!

On Fri, 7 Dec 2018 22:19:14 +0800, Chung-Lin Tang  
wrote:
> On 2018/12/7 07:32 PM, Thomas Schwinge wrote:
> > Does the following make sense?
> 
> I don't quite remember why I simply ensured asyncqueue creation here at the 
> time,
> maybe simply because it allowed simpler code at this level.

Well, I think it's just overhead we can avoid.  ;-)

> OTOH, the old logic is to GOMP_fatal upon such an unknown queue case, so 
> maybe that's
> the right thing to do (inside lookup_goacc_asyncqueue()), instead of silently 
> allowing it to pass.
> 
> WDYT?

I argued and posted patches (or will post if not yet done) to make this
defined, valid behavior,  "[OpenACC]
Correctly handle unseen async-arguments".  Please speak up soon if you
disagree.

Thus, I still propose that you include the following.

Please especially review the "libgomp/oacc-parallel.c:goacc_wait" change,
and confirm no corresponding "libgomp/oacc-parallel.c:GOACC_wait" change
to be done, because that code is structured differently.

commit c96c6607b77bdbf562f35209718d8b8c5705c603
Author: Thomas Schwinge 
Date:   Fri Dec 7 12:19:56 2018 +0100

into async re-work: don't create an asyncqueue just to then 
test/synchronize with it
---
 libgomp/oacc-async.c| 12 
 libgomp/oacc-parallel.c |  4 +++-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 553082fe3d4a..c9b134ac3380 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -119,8 +119,11 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
 gomp_fatal ("no device active");
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  return thr->dev->openacc.async.test_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (!aq)
+return 1;
+  else
+return thr->dev->openacc.async.test_func (aq);
 }
 
 int
@@ -148,8 +151,9 @@ acc_wait (int async)
 
   struct goacc_thread *thr = get_goacc_thread ();
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  thr->dev->openacc.async.synchronize_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (aq)
+thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index 2815a10f0386..9519abeccc2c 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -493,7 +493,9 @@ goacc_wait (int async, int num_waits, va_list *ap)
 {
   int qid = va_arg (*ap, int);
   
-  goacc_aq aq = get_goacc_asyncqueue (qid);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, qid);
+  if (!aq)
+   continue;
   if (acc_dev->openacc.async.test_func (aq))
continue;
   if (async == acc_async_sync)


Grüße
 Thomas


Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-07 Thread Chung-Lin Tang
On 2018/12/7 07:32 PM, Thomas Schwinge wrote:
> Hi Chung-Lin!
> 
> On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
> wrote:
>> These are the OpenACC specific changes, mostly the re-implementation of 
>> async-related acc_* runtime
>> library API functions to use the new backend plugin interfaces, in a 
>> non-target specific way.
> 
> (The patch was missing the "libgomp/oacc-int.h" changes, but I had no
> problem replicating these from openacc-gcc-8-branch.)
> 
> 
> Does the following make sense?

I don't quite remember why I simply ensured asyncqueue creation here at the 
time,
maybe simply because it allowed simpler code at this level.

OTOH, the old logic is to GOMP_fatal upon such an unknown queue case, so maybe 
that's
the right thing to do (inside lookup_goacc_asyncqueue()), instead of silently 
allowing it to pass.

WDYT?

Chung-Lin


> commit f86b403dbe6ed17afa8d157ec4089ff169a63680
> Author: Thomas Schwinge 
> Date:   Fri Dec 7 12:19:56 2018 +0100
> 
> Don't create an asyncqueue just to then test/synchronize with it
> ---
>  libgomp/oacc-async.c| 12 
>  libgomp/oacc-parallel.c |  4 +++-
>  2 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git libgomp/oacc-async.c libgomp/oacc-async.c
> index 553082fe3d4a..c9b134ac3380 100644
> --- libgomp/oacc-async.c
> +++ libgomp/oacc-async.c
> @@ -119,8 +119,11 @@ acc_async_test (int async)
>if (!thr || !thr->dev)
>  gomp_fatal ("no device active");
>  
> -  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> -  return thr->dev->openacc.async.test_func (aq);
> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
> +  if (!aq)
> +return 1;
> +  else
> +return thr->dev->openacc.async.test_func (aq);
>  }
>  
>  int
> @@ -148,8 +151,9 @@ acc_wait (int async)
>  
>struct goacc_thread *thr = get_goacc_thread ();
>  
> -  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> -  thr->dev->openacc.async.synchronize_func (aq);
> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
> +  if (aq)
> +thr->dev->openacc.async.synchronize_func (aq);
>  }
>  
>  /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
> diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
> index 7b6a6e515018..0be48f98036f 100644
> --- libgomp/oacc-parallel.c
> +++ libgomp/oacc-parallel.c
> @@ -491,7 +491,9 @@ goacc_wait (int async, int num_waits, va_list *ap)
>  {
>int qid = va_arg (*ap, int);
>
> -  goacc_aq aq = get_goacc_asyncqueue (qid);
> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, qid);
> +  if (!aq)
> + continue;
>if (acc_dev->openacc.async.test_func (aq))
>   continue;
>if (async == acc_async_sync)
> 
> 
> Grüße
>  Thomas
> 
> 



Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-12-07 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:10:47 +0800, Chung-Lin Tang  
wrote:
> These are the OpenACC specific changes, mostly the re-implementation of 
> async-related acc_* runtime
> library API functions to use the new backend plugin interfaces, in a 
> non-target specific way.

(The patch was missing the "libgomp/oacc-int.h" changes, but I had no
problem replicating these from openacc-gcc-8-branch.)


Does the following make sense?

commit f86b403dbe6ed17afa8d157ec4089ff169a63680
Author: Thomas Schwinge 
Date:   Fri Dec 7 12:19:56 2018 +0100

Don't create an asyncqueue just to then test/synchronize with it
---
 libgomp/oacc-async.c| 12 
 libgomp/oacc-parallel.c |  4 +++-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 553082fe3d4a..c9b134ac3380 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -119,8 +119,11 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
 gomp_fatal ("no device active");
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  return thr->dev->openacc.async.test_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (!aq)
+return 1;
+  else
+return thr->dev->openacc.async.test_func (aq);
 }
 
 int
@@ -148,8 +151,9 @@ acc_wait (int async)
 
   struct goacc_thread *thr = get_goacc_thread ();
 
-  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
-  thr->dev->openacc.async.synchronize_func (aq);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+  if (aq)
+thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index 7b6a6e515018..0be48f98036f 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -491,7 +491,9 @@ goacc_wait (int async, int num_waits, va_list *ap)
 {
   int qid = va_arg (*ap, int);
   
-  goacc_aq aq = get_goacc_asyncqueue (qid);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, false, qid);
+  if (!aq)
+   continue;
   if (acc_dev->openacc.async.test_func (aq))
continue;
   if (async == acc_async_sync)


Grüße
 Thomas


[PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

2018-09-25 Thread Chung-Lin Tang

Hi Thomas,
These are the OpenACC specific changes, mostly the re-implementation of 
async-related acc_* runtime
library API functions to use the new backend plugin interfaces, in a non-target 
specific way.

Thanks,
Chung-Lin


* oacc-async.c (get_goacc_thread): New function.
(get_goacc_thread_device): New function.
(lookup_goacc_asyncqueue): New function.
(get_goacc_asyncqueue): New function.
(acc_async_test): Adjust code to use new async design.
(acc_async_test_all): Likewise.
(acc_wait): Likewise.
(acc_wait_async): Likewise.
(acc_wait_all): Likewise.
(acc_wait_all_async): Likewise.
(acc_get_default_async): New API function.
(acc_set_default_async): Likewise.
(goacc_async_unmap_tgt): New function.
(goacc_async_copyout_unmap_vars): Likewise.
(goacc_async_free): Likewise.
(goacc_init_asyncqueues): Likewise.
(goacc_fini_asyncqueues): Likewise.
* oacc-cuda.c (acc_get_cuda_stream): Adjust code to use new async
design.
(acc_set_cuda_stream): Likewise.
* oacc-host.c (host_openacc_exec): Adjust parameters, remove 'async'.
(host_openacc_register_async_cleanup): Remove.
(host_openacc_async_exec): New function.
(host_openacc_async_test): Adjust parameters.
(host_openacc_async_test_all): Remove.
(host_openacc_async_wait): Remove.
(host_openacc_async_wait_async): Remove.
(host_openacc_async_wait_all): Remove.
(host_openacc_async_wait_all_async): Remove.
(host_openacc_async_set_async): Remove.
(host_openacc_async_synchronize): New function.
(host_openacc_async_serialize): New function.
(host_openacc_async_host2dev): New function.
(host_openacc_async_dev2host): New function.
(host_openacc_async_queue_callback): New function.
(host_openacc_async_construct): New function.
(host_openacc_async_destruct): New function.
(struct gomp_device_descr host_dispatch): Remove initialization of old
interface, add intialization of new async sub-struct.
* oacc-init.c (acc_shutdown_1): Adjust to use gomp_fini_device.
(goacc_attach_host_thread_to_device): Remove old async code usage, add
initialization of per-thread default_async.
* oacc-int.h (struct goacc_thread): Add default_async field.
(goacc_init_asyncqueues): New declaration.
(goacc_fini_asyncqueues): Likewise.
(goacc_async_copyout_unmap_vars): Likewise.
(goacc_async_free): Likewise.
(get_goacc_asyncqueue): Likewise.
(lookup_goacc_asyncqueue): Likewise.

* oacc-mem.c (memcpy_tofrom_device): Adjust code to use new async
design.
(present_create_copy): Likewise.
(delete_copyout): Likewise.
(update_dev_host): Likewise.
(gomp_acc_insert_pointer): Add async parameter, adjust code to use new
async design.
(gomp_acc_remove_pointer): Adjust code to use new async design.
* oacc-parallel.c (GOACC_parallel_keyed): Likewise.
(GOACC_enter_exit_data): Likewise.
(goacc_wait): Likewise.
(GOACC_update): Likewise.
* oacc-plugin.c (GOMP_PLUGIN_async_unmap_vars): Remove.

diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index a4e1863..68aaf19 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -27,10 +27,87 @@
.  */
 
 #include 
+#include 
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
 
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+gomp_fatal ("no device active");
+
+  return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+gomp_fatal ("no device active");
+
+  return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream.  */
+  if (async == acc_async_noval)
+async = thr->default_async;
+
+  if (async == acc_async_sync)
+return NULL;
+
+  if (async < 0)
+gomp_fatal ("bad async %d", async);
+
+  struct gomp_device_descr *dev = thr->dev;
+
+  if (!create
+  && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+return NULL;
+
+  gomp_mutex_lock (>openacc.async.lock);
+  if (async >= dev->openacc.async.nasyncqueue)
+{
+  int diff = async + 1 - dev->openacc.async.nasyncqueue;
+  dev->openacc.async.asyncqueue
+   = gomp_realloc (dev->openacc.async.asyncqueue,
+   sizeof (goacc_aq) * (async + 1));
+  memset (dev->openacc.async.asyncqueue +