Re: Too strict synchronization with the local (host) thread?

2018-12-14 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 11 Dec 2018 21:30:31 +0800, Chung-Lin Tang  
wrote:
> On 2018/12/7 11:56 PM, Thomas Schwinge wrote:
> >> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
> >> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
> >> @@ -114,6 +114,7 @@ main (int argc, char **argv)
> >>   
> >> for (i = 0; i < N; i++)
> >>   {
> >> +  stream = (CUstream) acc_get_cuda_stream (i & 1);
> >> r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0);
> > What's the motivation for this change?
> 
> To place work on both streams 0 and 1.

That's describing what it doesn, not the motivation behind it.  ;-)


> > ..., and this change are needed because we're now more strictly
> > synchronizing with the local (host) thread.
> > 
> > Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently
> > present:
> > 
> >  [...]
> >for (i = 0; i < N; i++)
> >  {
> >r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], 
> > kargs, 0);
> >if (r != CUDA_SUCCESS)
> >  {
> >fprintf (stderr, "cuLaunchKernel failed: %d\n", r);
> >abort ();
> >  }
> >  }
> > 
> > This launches N kernels on N separate async queues/CUDA streams, [0..N).
> > 
> >acc_wait_all_async (N);
> > 
> > Then, the "acc_wait_all_async (N)" -- in my understanding! -- should
> > *not*  synchronize with the local (host) thread, but instead just set up
> > the additional async queue/CUDA stream N to "depend" on [0..N).
> > 
> >for (i = 0; i <= N; i++)
> >  {
> >if (acc_async_test (i) != 0)
> >  abort ();
> >  }
> > 
> > Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still
> > running).
> > 
> >acc_wait (N);
> > 
> > Here, the "acc_wait (N)" would synchronize the local (host) thread with
> > async queue/CUDA stream N and thus recursively with [0..N).
> > 
> >for (i = 0; i <= N; i++)
> >  {
> >if (acc_async_test (i) != 1)
> >  abort ();
> >  }
> >  [...]
> > 
> > So, then all these async queues/CUDA streams here indeed are
> > "acc_async_test (i) != 1", thas is, idle.
> > 
> > 
> > Now, the more strict synchronization with the local (host) thread is not
> > wrong in term of correctness, but I suppose it will impact performance of
> > otherwise asynchronous operations, which now get synchronized too much?
> > 
> > Or, of course, I'm misunderstanding something...
> 
> IIRC, we encountered many issues where people misunderstood the meaning of 
> "wait+async",
> using it as if the local host sync happened, where in our original 
> implementation it does not.

..., and that's the right thing, in my opinion.  (Do you disagree?)

> Also some areas of the OpenACC spec were vague on whether the local host 
> synchronization should
> or should not happen; basically, the wording treated as if it was only an 
> implementation detail
> and didn't matter, and didn't acknowledge that this would be something 
> visible to the user.

I suppose in correct code that correctly uses a different mechanism for
inter-thread synchronization, it shouldn't be visible?  (Well, with the
additional synchronization, it would be visible in terms of performance
degradation.)

For example, OpenACC 2.6, 3.2.11. "acc_wait" explicitly states that "If
two or more threads share the same accelerator, the 'acc_wait' routine
will return only if all matching asynchronous operations initiated by
this thread have completed; there is no guarantee that all matching
asynchronous operations initiated by other threads have completed".

I agree that this could be made more explicit throught the specification,
and also the reading of OpenACC 2.6, 2.16.1. "async clause" is a bit
confusing regarding multiple host threads, but as I understand, the idea
still is that such wait operations do not synchronize at the host thread
level.  (Let's please assume that, and then work with the OpenACC
technical committee to get that clarified in the documentation.)

> At the end, IIRC, I decided that adding a local host synchronization is 
> easier for all of us,

Well...

> and took the opportunity of the re-org to make this change.

Well...  Again, a re-org/re-work should not make such functional
changes...

> That said, I didn't notice those tests you listed above were meant to test 
> such delicate behavior.
> 
> > (For avoidance of doubt, I would accept the "async re-work" as is, but we
> > should eventually clarify this, and restore the behavior we -- apparently
> > -- had before, where we didn't synchronize so much?  (So, technically,
> > the "async re-work" would constitute a regression for this kind of
> > usage?)
> 
> It's not hard to restore the old behavior, just a few lines to delete. 
> Although as described
> above, this change was deliberate.
> 
> This might be another issue to raise with the committee. I think I t

Re: Too strict synchronization with the local (host) thread?

2018-12-11 Thread Chung-Lin Tang

On 2018/12/7 11:56 PM, Thomas Schwinge wrote:

--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
@@ -114,6 +114,7 @@ main (int argc, char **argv)
  
for (i = 0; i < N; i++)

  {
+  stream = (CUstream) acc_get_cuda_stream (i & 1);
r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0);

What's the motivation for this change?


To place work on both streams 0 and 1.


..., and this change are needed because we're now more strictly
synchronizing with the local (host) thread.

Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently
present:

 [...]
   for (i = 0; i < N; i++)
 {
   r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], kargs, 
0);
   if (r != CUDA_SUCCESS)
 {
   fprintf (stderr, "cuLaunchKernel failed: %d\n", r);
   abort ();
 }
 }

This launches N kernels on N separate async queues/CUDA streams, [0..N).

   acc_wait_all_async (N);

Then, the "acc_wait_all_async (N)" -- in my understanding! -- should
*not*  synchronize with the local (host) thread, but instead just set up
the additional async queue/CUDA stream N to "depend" on [0..N).

   for (i = 0; i <= N; i++)
 {
   if (acc_async_test (i) != 0)
 abort ();
 }

Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still
running).

   acc_wait (N);

Here, the "acc_wait (N)" would synchronize the local (host) thread with
async queue/CUDA stream N and thus recursively with [0..N).

   for (i = 0; i <= N; i++)
 {
   if (acc_async_test (i) != 1)
 abort ();
 }
 [...]

So, then all these async queues/CUDA streams here indeed are
"acc_async_test (i) != 1", thas is, idle.


Now, the more strict synchronization with the local (host) thread is not
wrong in term of correctness, but I suppose it will impact performance of
otherwise asynchronous operations, which now get synchronized too much?

Or, of course, I'm misunderstanding something...


IIRC, we encountered many issues where people misunderstood the meaning of 
"wait+async",
using it as if the local host sync happened, where in our original 
implementation it does not.

Also some areas of the OpenACC spec were vague on whether the local host 
synchronization should
or should not happen; basically, the wording treated as if it was only an 
implementation detail
and didn't matter, and didn't acknowledge that this would be something visible 
to the user.

At the end, IIRC, I decided that adding a local host synchronization is easier 
for all of us,
and took the opportunity of the re-org to make this change.

That said, I didn't notice those tests you listed above were meant to test such 
delicate behavior.


(For avoidance of doubt, I would accept the "async re-work" as is, but we
should eventually clarify this, and restore the behavior we -- apparently
-- had before, where we didn't synchronize so much?  (So, technically,
the "async re-work" would constitute a regression for this kind of
usage?)


It's not hard to restore the old behavior, just a few lines to delete. Although 
as described
above, this change was deliberate.

This might be another issue to raise with the committee. I think I tried on 
this exact issue
a long time ago, but never got answers.

Thanks,
Chung-Lin


Too strict synchronization with the local (host) thread? (was: [PATCH 5/6, OpenACC, libgomp] Async re-work, C/C++ testsuite changes)

2018-12-07 Thread Thomas Schwinge
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:11:42 +0800, Chung-Lin Tang  
wrote:
> These are the testsuite/libgomp.oacc-c-c++-common/* changes.

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
> @@ -114,6 +114,7 @@ main (int argc, char **argv)
>  
>for (i = 0; i < N; i++)
>  {
> +  stream = (CUstream) acc_get_cuda_stream (i & 1);
>r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0);

What's the motivation for this change?

And then:

> @@ -122,11 +123,11 @@ main (int argc, char **argv)
>   }
>  }
>  
> -  acc_wait_async (0, 1);
> -
>if (acc_async_test (0) != 0)
>  abort ();
>  
> +  acc_wait_async (0, 1);
> +
>if (acc_async_test (1) != 0)
>  abort ();

I somehow feel that this change...

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
> @@ -133,7 +133,7 @@ main (int argc, char **argv)
>  
>for (i = 0; i <= N; i++)
>  {
> -  if (acc_async_test (i) != 0)
> +  if (acc_async_test (i) == 0)
>   abort ();
>  }

..., and this change are needed because we're now more strictly
synchronizing with the local (host) thread.

Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently
present:

[...]
  for (i = 0; i < N; i++)
{
  r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], kargs, 0);
  if (r != CUDA_SUCCESS)
{
  fprintf (stderr, "cuLaunchKernel failed: %d\n", r);
  abort ();
}
}

This launches N kernels on N separate async queues/CUDA streams, [0..N).

  acc_wait_all_async (N);

Then, the "acc_wait_all_async (N)" -- in my understanding! -- should
*not* synchronize with the local (host) thread, but instead just set up
the additional async queue/CUDA stream N to "depend" on [0..N).

  for (i = 0; i <= N; i++)
{
  if (acc_async_test (i) != 0)
abort ();
}

Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still
running).

  acc_wait (N);

Here, the "acc_wait (N)" would synchronize the local (host) thread with
async queue/CUDA stream N and thus recursively with [0..N).

  for (i = 0; i <= N; i++)
{
  if (acc_async_test (i) != 1)
abort ();
}
[...]

So, then all these async queues/CUDA streams here indeed are
"acc_async_test (i) != 1", thas is, idle.


Now, the more strict synchronization with the local (host) thread is not
wrong in term of correctness, but I suppose it will impact performance of
otherwise asynchronous operations, which now get synchronized too much?

Or, of course, I'm misunderstanding something...

(For avoidance of doubt, I would accept the "async re-work" as is, but we
should eventually clarify this, and restore the behavior we -- apparently
-- had before, where we didn't synchronize so much?  (So, technically,
the "async re-work" would constitute a regression for this kind of
usage?)


Grüße
 Thomas