Re: nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case (was: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes)

2023-01-13 Thread Thomas Schwinge
Hi!

On 2023-01-13T21:17:43+0800, Chung-Lin Tang  wrote:
> On 2023/1/12 9:51 PM, Thomas Schwinge wrote:
>> In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with
>> 'res != CUDA_SUCCESS' ("an illegal memory access was encountered").
>> When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
>> (..., which deadlocks); that's generally problematic: per
>> https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
>> "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".
>
> I remember running into this myself when first creating this async support
> (IIRC in my case it was cuFree()-ing something) yet you've found another 
> mistake here! :)

;-)

>> Given that eventually we must reach a host/device synchronization point
>> (latest when the device is shut down at program termination), and the
>> non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
>> replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the
>> "nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case"
>> attached.  OK to push?
>
> I think this patch is fine. Actual approval powers are your's or Tom's :)

ACK.  I'll let it sit for some more time before 'git push'.


>> (Might we even skip 'GOMP_PLUGIN_error' here, understanding that the
>> error will be caught and reported at the next host/device synchronization
>> point?  But I've not verified that.)
>
> Actually, the CUDA driver API docs are a bit vague on what exactly this
> CUresult arg to the callback actually means. The 'res != CUDA_SUCCESS' 
> handling
> here was basically just generic handling.

I suppose this really is just for its own use: for example, skip certain
things in presence of pre-existing error?

> I am not really sure what is the
> true right thing to do here (is the error still retained by CUDA after the 
> callback
> completes?)

Indeed the latter is what I do observe:

  GOMP_OFFLOAD_openacc_async_exec: prepare mappings
  nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=1, vectors=32
  nvptx_exec: kernel main$_omp_fn$0: finished

libgomp: cuMemcpyDtoHAsync_v2 error: an illegal memory access was 
encountered

libgomp:
libgomp: Copying of dev object [0x7f9a4500..0x7f9a4528) to host 
object [0x1d89350..0x1d89378) failed
cuda_callback_wrapper error: an illegal memory access was encountered

libgomp: cuStreamDestroy error: an illegal memory access was encountered

libgomp: cuMemFree_v2 error: an illegal memory access was encountered

libgomp: device finalization failed

Here, after the 'async' OpenACC 'parallel' a 'copyout' gets enqueued,
thus 'cuMemcpyDtoHAsync_v2', which is where we first get the device-side
fault reported (all as expected).  Then -- CUDA-internally
multi-threaded, I suppose (thus the mangled printing) -- we print the
'Copying [...] failed' error plus get 'cuda_callback_wrapper' invoked.
This receives the previous 'CUresult' as seen, and then the error is
still visible at device shut-down, as shown by the following reports.
(This makes sense, as the 'CUcontext' does not magically recover.)

Also, per
,
"In the event of a device error, all subsequently executed callbacks will
receive an appropriate 'CUresult'".

But again: I'm perfectly fine with the repeated error reporting.


Grüße
 Thomas
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case (was: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes)

2023-01-13 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,

On 2023/1/12 9:51 PM, Thomas Schwinge wrote:
> In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with
> 'res != CUDA_SUCCESS' ("an illegal memory access was encountered").
> When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
> (..., which deadlocks); that's generally problematic: per
> https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
> "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".

I remember running into this myself when first creating this async support
(IIRC in my case it was cuFree()-ing something) yet you've found another 
mistake here! :) 

> Given that eventually we must reach a host/device synchronization point
> (latest when the device is shut down at program termination), and the
> non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
> replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the
> "nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case"
> attached.  OK to push?

I think this patch is fine. Actual approval powers are your's or Tom's :)

> 
> (Might we even skip 'GOMP_PLUGIN_error' here, understanding that the
> error will be caught and reported at the next host/device synchronization
> point?  But I've not verified that.)

Actually, the CUDA driver API docs are a bit vague on what exactly this
CUresult arg to the callback actually means. The 'res != CUDA_SUCCESS' handling
here was basically just generic handling. I am not really sure what is the
true right thing to do here (is the error still retained by CUDA after the 
callback
completes?)

Chung-Lin



nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case (was: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes)

2023-01-12 Thread Thomas Schwinge
Hi Chung-Lin, Tom!

It's been a while:

On 2018-09-25T21:11:58+0800, Chung-Lin Tang  wrote:
> [...] NVPTX/CUDA-specific implementation
> of the new-style goacc_asyncqueues.

In an OpenACC 'async' setting, where the device kernel (expectedly)
crashes because of "an illegal memory access was encountered", I'm
running into a deadlock here:

> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c

> +static void
> +cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
> +{
> +  if (res != CUDA_SUCCESS)
> +GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
> +  struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
> +  cb->fn (cb->ptr);
> +  free (ptr);
> +}
> +
> +void
> +GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
> +void (*callback_fn)(void *),
> +void *userptr)
> +{
> +  struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b));
> +  b->fn = callback_fn;
> +  b->ptr = userptr;
> +  b->aq = aq;
> +  CUDA_CALL_ASSERT (cuStreamAddCallback, aq->cuda_stream,
> + cuda_callback_wrapper, (void *) b, 0);
> +}

In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with
'res != CUDA_SUCCESS' ("an illegal memory access was encountered").
When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
(..., which deadlocks); that's generally problematic: per

"'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".

Given that eventually we must reach a host/device synchronization point
(latest when the device is shut down at program termination), and the
non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the
"nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case"
attached.  OK to push?

(Might we even skip 'GOMP_PLUGIN_error' here, understanding that the
error will be caught and reported at the next host/device synchronization
point?  But I've not verified that.)


Grüße
 Thomas


-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
>From b7ddcc0807967750e3c884326ed4c53c05cde81f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Thu, 12 Jan 2023 14:39:46 +0100
Subject: [PATCH] nvptx: Avoid deadlock in 'cuStreamAddCallback' callback,
 error case

When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device
(..., which may deadlock); that's generally problematic: per

"'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls".

Given that eventually we must reach a host/device synchronization point
(latest when the device is shut down at program termination), and the
non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to
replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error'.

	libgomp/
	* plugin/plugin-nvptx.c (cuda_callback_wrapper): Invoke
	'GOMP_PLUGIN_error' instead of 'GOMP_PLUGIN_fatal'.
---
 libgomp/plugin/plugin-nvptx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 395639537e83..cdb3d435bdc8 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1927,7 +1927,7 @@ static void
 cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
 {
   if (res != CUDA_SUCCESS)
-GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
+GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (res));
   struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
   cb->fn (cb->ptr);
   free (ptr);
-- 
2.39.0



Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes (revised, v3)

2018-12-18 Thread Chung-Lin Tang

On 2018/12/11 9:50 PM, Chung-Lin Tang wrote:

On 2018/12/10 6:02 PM, Chung-Lin Tang wrote:

On 2018/12/7 04:57 AM, Thomas Schwinge wrote>> --- 
a/libgomp/plugin/plugin-nvptx.c

+++ b/libgomp/plugin/plugin-nvptx.c



+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (void)
+{
+  struct goacc_asyncqueue *aq
+= GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
+  aq->cuda_stream = NULL;
+  CUDA_CALL_ASSERT (cuStreamCreate, >cuda_stream, CU_STREAM_DEFAULT);


Curiously (this was the same in the code before): does this have to be
"CU_STREAM_DEFAULT" instead of "CU_STREAM_NON_BLOCKING", because we want
to block anything from running in parallel with "acc_async_sync" GPU
kernels, that use the "NULL" stream?  (Not asking you to change this now,
but I wonder if this is overly strict?)


IIUC, this non-blocking only pertains to the "Legacy Default Stream" in CUDA, 
which we're pretty much ignoring; we should be using the newer per-thread default stream 
model. We could review this issue later.


+  if (aq->cuda_stream == NULL)
+GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");


Can this actually happen, given the "CUDA_CALL_ASSERT" usage above?


Hmm, yeah I think this is superfluous too...


+  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);


Why is the synchronization needed here?


I don't remember, could likely be something added during debug.
I'll remove this and test if things are okay.


I have removed the above seemingly unneeded lines and re-tested, appears okay.
Also the formerly attached version seemed to for some reason had many conflicts
with older code, all resolved in this v2 nvptx part.


GOMP_OFFLOAD_openacc_async_construct is updated to return NULL for failure,
there's also some adjustments in oacc-async.c, coming next.

Chung-Lin



diff -ru trunk-orig/libgomp/plugin/plugin-nvptx.c 
trunk-work/libgomp/plugin/plugin-nvptx.c
--- trunk-orig/libgomp/plugin/plugin-nvptx.c2018-12-18 18:16:57.804871502 
+0800
+++ trunk-work/libgomp/plugin/plugin-nvptx.c2018-12-18 22:07:43.483068743 
+0800
@@ -1364,16 +1364,12 @@
 struct goacc_asyncqueue *
 GOMP_OFFLOAD_openacc_async_construct (void)
 {
+  CUstream stream = NULL;
+  CUDA_CALL_ERET (NULL, cuStreamCreate, , CU_STREAM_DEFAULT);
+
   struct goacc_asyncqueue *aq
 = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
-  aq->cuda_stream = NULL;
-  CUDA_CALL_ASSERT (cuStreamCreate, >cuda_stream, CU_STREAM_DEFAULT);
-  if (aq->cuda_stream == NULL)
-GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");
-
-  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
-
-
+  aq->cuda_stream = stream;
   return aq;
 }
 
Index: libgomp/plugin/cuda/cuda.h
===
--- libgomp/plugin/cuda/cuda.h  (revision 267226)
+++ libgomp/plugin/cuda/cuda.h  (working copy)
@@ -54,7 +54,11 @@ typedef enum {
   CUDA_ERROR_INVALID_CONTEXT = 201,
   CUDA_ERROR_NOT_FOUND = 500,
   CUDA_ERROR_NOT_READY = 600,
-  CUDA_ERROR_LAUNCH_FAILED = 719
+  CUDA_ERROR_LAUNCH_FAILED = 719,
+  CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720,
+  CUDA_ERROR_NOT_PERMITTED = 800,
+  CUDA_ERROR_NOT_SUPPORTED = 801,
+  CUDA_ERROR_UNKNOWN = 999
 } CUresult;
 
 typedef enum {
@@ -173,6 +177,8 @@ CUresult cuModuleLoadData (CUmodule *, const void
 CUresult cuModuleUnload (CUmodule);
 CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
  CUoccupancyB2DSize, size_t, int);
+typedef void (*CUstreamCallback)(CUstream, CUresult, void *);
+CUresult cuStreamAddCallback(CUstream, CUstreamCallback, void *, unsigned int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
Index: libgomp/plugin/cuda-lib.def
===
--- libgomp/plugin/cuda-lib.def (revision 267226)
+++ libgomp/plugin/cuda-lib.def (working copy)
@@ -42,6 +42,7 @@ CUDA_ONE_CALL (cuModuleLoad)
 CUDA_ONE_CALL (cuModuleLoadData)
 CUDA_ONE_CALL (cuModuleUnload)
 CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)
+CUDA_ONE_CALL (cuStreamAddCallback)
 CUDA_ONE_CALL (cuStreamCreate)
 CUDA_ONE_CALL (cuStreamDestroy)
 CUDA_ONE_CALL (cuStreamQuery)
Index: libgomp/plugin/plugin-nvptx.c
===
--- libgomp/plugin/plugin-nvptx.c   (revision 267226)
+++ libgomp/plugin/plugin-nvptx.c   (working copy)
@@ -192,21 +192,18 @@ cuda_error (CUresult r)
 static unsigned int instantiated_devices = 0;
 static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
-struct cuda_map
+/* NVPTX/CUDA specific definition of asynchronous queues.  */
+struct goacc_asyncqueue
 {
-  CUdeviceptr d;
-  size_t size;
-  bool active;
-  struct cuda_map *next;
+  CUstream cuda_stream;
 };
 
-struct ptx_stream
+struct nvptx_callback
 {
-  CUstream stream;
-  pthread_t host_thread;
-  bool multithreaded;
-  struct cuda_map 

Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes (revised, v2)

2018-12-11 Thread Chung-Lin Tang

On 2018/12/10 6:02 PM, Chung-Lin Tang wrote:

On 2018/12/7 04:57 AM, Thomas Schwinge wrote>> --- 
a/libgomp/plugin/plugin-nvptx.c

+++ b/libgomp/plugin/plugin-nvptx.c



+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (void)
+{
+  struct goacc_asyncqueue *aq
+    = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
+  aq->cuda_stream = NULL;
+  CUDA_CALL_ASSERT (cuStreamCreate, >cuda_stream, CU_STREAM_DEFAULT);


Curiously (this was the same in the code before): does this have to be
"CU_STREAM_DEFAULT" instead of "CU_STREAM_NON_BLOCKING", because we want
to block anything from running in parallel with "acc_async_sync" GPU
kernels, that use the "NULL" stream?  (Not asking you to change this now,
but I wonder if this is overly strict?)


IIUC, this non-blocking only pertains to the "Legacy Default Stream" in CUDA, 
which we're pretty much ignoring; we should be using the newer per-thread default stream 
model. We could review this issue later.


+  if (aq->cuda_stream == NULL)
+    GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");


Can this actually happen, given the "CUDA_CALL_ASSERT" usage above?


Hmm, yeah I think this is superfluous too...


+  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);


Why is the synchronization needed here?


I don't remember, could likely be something added during debug.
I'll remove this and test if things are okay.


I have removed the above seemingly unneeded lines and re-tested, appears okay.
Also the formerly attached version seemed to for some reason had many conflicts
with older code, all resolved in this v2 nvptx part.

Thanks,
Chung-Lin
Index: libgomp/plugin/cuda/cuda.h
===
--- libgomp/plugin/cuda/cuda.h  (revision 266973)
+++ libgomp/plugin/cuda/cuda.h  (working copy)
@@ -54,7 +54,11 @@ typedef enum {
   CUDA_ERROR_INVALID_CONTEXT = 201,
   CUDA_ERROR_NOT_FOUND = 500,
   CUDA_ERROR_NOT_READY = 600,
-  CUDA_ERROR_LAUNCH_FAILED = 719
+  CUDA_ERROR_LAUNCH_FAILED = 719,
+  CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720,
+  CUDA_ERROR_NOT_PERMITTED = 800,
+  CUDA_ERROR_NOT_SUPPORTED = 801,
+  CUDA_ERROR_UNKNOWN = 999
 } CUresult;
 
 typedef enum {
@@ -173,6 +177,8 @@ CUresult cuModuleLoadData (CUmodule *, const void
 CUresult cuModuleUnload (CUmodule);
 CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
  CUoccupancyB2DSize, size_t, int);
+typedef void (*CUstreamCallback)(CUstream, CUresult, void *);
+CUresult cuStreamAddCallback(CUstream, CUstreamCallback, void *, unsigned int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
Index: libgomp/plugin/cuda-lib.def
===
--- libgomp/plugin/cuda-lib.def (revision 266973)
+++ libgomp/plugin/cuda-lib.def (working copy)
@@ -42,6 +42,7 @@ CUDA_ONE_CALL (cuModuleLoad)
 CUDA_ONE_CALL (cuModuleLoadData)
 CUDA_ONE_CALL (cuModuleUnload)
 CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)
+CUDA_ONE_CALL (cuStreamAddCallback)
 CUDA_ONE_CALL (cuStreamCreate)
 CUDA_ONE_CALL (cuStreamDestroy)
 CUDA_ONE_CALL (cuStreamQuery)
Index: libgomp/plugin/plugin-nvptx.c
===
--- libgomp/plugin/plugin-nvptx.c   (revision 266973)
+++ libgomp/plugin/plugin-nvptx.c   (working copy)
@@ -192,21 +192,18 @@ cuda_error (CUresult r)
 static unsigned int instantiated_devices = 0;
 static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
-struct cuda_map
+/* NVPTX/CUDA specific definition of asynchronous queues.  */
+struct goacc_asyncqueue
 {
-  CUdeviceptr d;
-  size_t size;
-  bool active;
-  struct cuda_map *next;
+  CUstream cuda_stream;
 };
 
-struct ptx_stream
+struct nvptx_callback
 {
-  CUstream stream;
-  pthread_t host_thread;
-  bool multithreaded;
-  struct cuda_map *map;
-  struct ptx_stream *next;
+  void (*fn) (void *);
+  void *ptr;
+  struct goacc_asyncqueue *aq;
+  struct nvptx_callback *next;
 };
 
 /* Thread-specific data for PTX.  */
@@ -213,120 +210,13 @@ static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTE
 
 struct nvptx_thread
 {
-  struct ptx_stream *current_stream;
+  /* We currently have this embedded inside the plugin because libgomp manages
+ devices through integer target_ids.  This might be better if using an
+ opaque target-specific pointer directly from gomp_device_descr.  */
   struct ptx_device *ptx_dev;
 };
 
-static struct cuda_map *
-cuda_map_create (size_t size)
-{
-  struct cuda_map *map = GOMP_PLUGIN_malloc (sizeof (struct cuda_map));
 
-  assert (map);
-
-  map->next = NULL;
-  map->size = size;
-  map->active = false;
-
-  CUDA_CALL_ERET (NULL, cuMemAlloc, >d, size);
-  assert (map->d);
-
-  return map;
-}
-
-static void
-cuda_map_destroy (struct cuda_map *map)
-{
-  CUDA_CALL_ASSERT (cuMemFree, map->d);
-  free (map);
-}
-
-/* The 

Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

2018-12-10 Thread Chung-Lin Tang
On 2018/12/7 04:57 AM, Thomas Schwinge wrote>> --- 
a/libgomp/plugin/plugin-nvptx.c

+++ b/libgomp/plugin/plugin-nvptx.c



+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (void)
+{
+  struct goacc_asyncqueue *aq
+= GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
+  aq->cuda_stream = NULL;
+  CUDA_CALL_ASSERT (cuStreamCreate, >cuda_stream, CU_STREAM_DEFAULT);


Curiously (this was the same in the code before): does this have to be
"CU_STREAM_DEFAULT" instead of "CU_STREAM_NON_BLOCKING", because we want
to block anything from running in parallel with "acc_async_sync" GPU
kernels, that use the "NULL" stream?  (Not asking you to change this now,
but I wonder if this is overly strict?)


IIUC, this non-blocking only pertains to the "Legacy Default Stream" in 
CUDA, which we're pretty much ignoring; we should be using the newer 
per-thread default stream model. We could review this issue later.



+  if (aq->cuda_stream == NULL)
+GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");


Can this actually happen, given the "CUDA_CALL_ASSERT" usage above?


Hmm, yeah I think this is superfluous too...


+  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);


Why is the synchronization needed here?


I don't remember, could likely be something added during debug.
I'll remove this and test if things are okay.

Thanks,
Chung-Lin


Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

2018-12-06 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:11:58 +0800, Chung-Lin Tang  
wrote:
> Hi Tom,
> this patch removes large portions of plugin/plugin-nvptx.c, since a lot of it 
> is
> now in oacc-async.c now. The new code is essentially a NVPTX/CUDA-specific 
> implementation
> of the new-style goacc_asyncqueues.

> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c

> +struct goacc_asyncqueue *
> +GOMP_OFFLOAD_openacc_async_construct (void)
> +{
> +  struct goacc_asyncqueue *aq
> += GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
> +  aq->cuda_stream = NULL;
> +  CUDA_CALL_ASSERT (cuStreamCreate, >cuda_stream, CU_STREAM_DEFAULT);

Curiously (this was the same in the code before): does this have to be
"CU_STREAM_DEFAULT" instead of "CU_STREAM_NON_BLOCKING", because we want
to block anything from running in parallel with "acc_async_sync" GPU
kernels, that use the "NULL" stream?  (Not asking you to change this now,
but I wonder if this is overly strict?)

> +  if (aq->cuda_stream == NULL)
> +GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");

Can this actually happen, given the "CUDA_CALL_ASSERT" usage above?

> +  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);

Why is the synchronization needed here?

> +  return aq;
> +}


Grüße
 Thomas


Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

2018-10-05 Thread Tom de Vries
On 9/25/18 3:11 PM, Chung-Lin Tang wrote:
> Hi Tom,
> this patch removes large portions of plugin/plugin-nvptx.c, since a lot
> of it is
> now in oacc-async.c now.

Yay!

> The new code is essentially a
> NVPTX/CUDA-specific implementation
> of the new-style goacc_asyncqueues.
> 
> Also, some needed functions in cuda-lib.def are added. The cuda.h
> function has also
> been updated to build independently without a CUDA installation.
> 

I see these formatting issues:
...
$ check_GNU_style.sh async-06.nvptx.patch

There should be exactly one space between function name and parenthesis.
35:+CUresult cuStreamAddCallback(CUstream, CUstreamCallback, void *,
unsigned int);

Trailing operator.
1320:+  struct nvptx_thread *nvthd =
...

Otherwise, OK.

Thanks,
- Tom


> Thanks,
> Chung-Lin
> 
> * plugin/plugin-nvptx.c (struct cuda_map): Remove.
> (struct ptx_stream): Remove.
> (struct nvptx_thread): Remove current_stream field.
> (cuda_map_create): Remove.
> (cuda_map_destroy): Remove.
> (map_init): Remove.
> (map_fini): Remove.
> (map_pop): Remove.
> (map_push): Remove.
> (struct goacc_asyncqueue): Define.
> (struct nvptx_callback): Define.
> (struct ptx_free_block): Define.
> (struct ptx_device): Remove null_stream, active_streams, async_streams,
> stream_lock, and next fields.
> (enum ptx_event_type): Remove.
> (struct ptx_event): Remove.
> (ptx_event_lock): Remove.
> (ptx_events): Remove.
> (init_streams_for_device): Remove.
> (fini_streams_for_device): Remove.
> (select_stream_for_async): Remove.
> (nvptx_init): Remove ptx_events and ptx_event_lock references.
> (nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
> case.
> (nvptx_open_device): Add free_blocks initialization, remove
> init_streams_for_device call.
> (nvptx_close_device): Remove fini_streams_for_device call, add
> free_blocks destruct code.
> (event_gc): Remove.
> (event_add): Remove.
> (nvptx_exec): Adjust parameters and code.
> (nvptx_free): Likewise.
> (nvptx_host2dev): Remove.
> (nvptx_dev2host): Remove.
> (nvptx_set_async): Remove.
> (nvptx_async_test): Remove.
> (nvptx_async_test_all): Remove.
> (nvptx_wait): Remove.
> (nvptx_wait_async): Remove.
> (nvptx_wait_all): Remove.
> (nvptx_wait_all_async): Remove.
> (nvptx_get_cuda_stream): Remove.
> (nvptx_set_cuda_stream): Remove.
> (GOMP_OFFLOAD_alloc): Adjust code.
> (GOMP_OFFLOAD_free): Likewise.
> (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
> (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
> (GOMP_OFFLOAD_openacc_async_test_all): Remove.
> (GOMP_OFFLOAD_openacc_async_wait): Remove.
> (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
> (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
> (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
> (GOMP_OFFLOAD_openacc_async_set_async): Remove.
> (cuda_free_argmem): New function.
> (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
> (GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
> (GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
> (GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
> (GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
> (GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
> (GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
> (GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
> (GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
> (GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
> (cuda_callback_wrapper): New function.
> (cuda_memcpy_sanity_check): New function.
> (GOMP_OFFLOAD_host2dev): Remove and re-implement.
> (GOMP_OFFLOAD_dev2host): Remove and re-implement.
> (GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
> (GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.


[PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

2018-09-25 Thread Chung-Lin Tang

Hi Tom,
this patch removes large portions of plugin/plugin-nvptx.c, since a lot of it is
now in oacc-async.c now. The new code is essentially a NVPTX/CUDA-specific 
implementation
of the new-style goacc_asyncqueues.

Also, some needed functions in cuda-lib.def are added. The cuda.h function has 
also
been updated to build independently without a CUDA installation.

Thanks,
Chung-Lin

* plugin/plugin-nvptx.c (struct cuda_map): Remove.
(struct ptx_stream): Remove.
(struct nvptx_thread): Remove current_stream field.
(cuda_map_create): Remove.
(cuda_map_destroy): Remove.
(map_init): Remove.
(map_fini): Remove.
(map_pop): Remove.
(map_push): Remove.
(struct goacc_asyncqueue): Define.
(struct nvptx_callback): Define.
(struct ptx_free_block): Define.
(struct ptx_device): Remove null_stream, active_streams, async_streams,
stream_lock, and next fields.
(enum ptx_event_type): Remove.
(struct ptx_event): Remove.
(ptx_event_lock): Remove.
(ptx_events): Remove.
(init_streams_for_device): Remove.
(fini_streams_for_device): Remove.
(select_stream_for_async): Remove.
(nvptx_init): Remove ptx_events and ptx_event_lock references.
(nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
case.
(nvptx_open_device): Add free_blocks initialization, remove
init_streams_for_device call.
(nvptx_close_device): Remove fini_streams_for_device call, add
free_blocks destruct code.
(event_gc): Remove.
(event_add): Remove.
(nvptx_exec): Adjust parameters and code.
(nvptx_free): Likewise.
(nvptx_host2dev): Remove.
(nvptx_dev2host): Remove.
(nvptx_set_async): Remove.
(nvptx_async_test): Remove.
(nvptx_async_test_all): Remove.
(nvptx_wait): Remove.
(nvptx_wait_async): Remove.
(nvptx_wait_all): Remove.
(nvptx_wait_all_async): Remove.
(nvptx_get_cuda_stream): Remove.
(nvptx_set_cuda_stream): Remove.
(GOMP_OFFLOAD_alloc): Adjust code.
(GOMP_OFFLOAD_free): Likewise.
(GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
(GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
(GOMP_OFFLOAD_openacc_async_test_all): Remove.
(GOMP_OFFLOAD_openacc_async_wait): Remove.
(GOMP_OFFLOAD_openacc_async_wait_async): Remove.
(GOMP_OFFLOAD_openacc_async_wait_all): Remove.
(GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
(GOMP_OFFLOAD_openacc_async_set_async): Remove.
(cuda_free_argmem): New function.
(GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
(GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
(GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
(GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
(GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
(GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
(cuda_callback_wrapper): New function.
(cuda_memcpy_sanity_check): New function.
(GOMP_OFFLOAD_host2dev): Remove and re-implement.
(GOMP_OFFLOAD_dev2host): Remove and re-implement.
(GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index b2a4c21..a16badc 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -42,6 +42,7 @@ CUDA_ONE_CALL (cuModuleLoad)
 CUDA_ONE_CALL (cuModuleLoadData)
 CUDA_ONE_CALL (cuModuleUnload)
 CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)
+CUDA_ONE_CALL (cuStreamAddCallback)
 CUDA_ONE_CALL (cuStreamCreate)
 CUDA_ONE_CALL (cuStreamDestroy)
 CUDA_ONE_CALL (cuStreamQuery)
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index b4c1b29..326db54 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -54,7 +54,11 @@ typedef enum {
   CUDA_ERROR_INVALID_CONTEXT = 201,
   CUDA_ERROR_NOT_FOUND = 500,
   CUDA_ERROR_NOT_READY = 600,
-  CUDA_ERROR_LAUNCH_FAILED = 719
+  CUDA_ERROR_LAUNCH_FAILED = 719,
+  CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720,
+  CUDA_ERROR_NOT_PERMITTED = 800,
+  CUDA_ERROR_NOT_SUPPORTED = 801,
+  CUDA_ERROR_UNKNOWN = 999
 } CUresult;
 
 typedef enum {
@@ -173,6 +177,8 @@ CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
 CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,