Hi!

Ping.

On Mon, 11 Jan 2016 11:39:46 +0100, I wrote:
> Ping.
> 
> On Wed, 23 Dec 2015 12:05:32 +0100, I wrote:
> > Ping.
> > 
> > On Wed, 16 Dec 2015 13:30:21 +0100, I wrote:
> > > On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin <iver...@gmail.com> wrote:
> > > > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > > > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > > > > +/* This function finalizes all initialized devices.  */
> > > > > > +
> > > > > > +static void
> > > > > > +gomp_target_fini (void)
> > > > > > +{
> > > > > > +  [...]
> > > > > 
> > > > > The question is what will this do if there are async target tasks 
> > > > > still
> > > > > running on some of the devices at this point (forgotten #pragma omp 
> > > > > taskwait
> > > > > or similar if target nowait regions are started outside of parallel 
> > > > > region,
> > > > > or exit inside of parallel, etc.  But perhaps it can be handled 
> > > > > incrementally.
> > > > > Also there is the question that the 
> > > > > So I think the patch is ok with the above mentioned changes.
> > > > 
> > > > Here is what I've committed to trunk.
> > > 
> > > > --- a/libgomp/libgomp.h
> > > > +++ b/libgomp/libgomp.h
> > > > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
> > > >    } cuda;
> > > >  } acc_dispatch_t;
> > > >  
> > > > +/* Various state of the accelerator device.  */
> > > > +enum gomp_device_state
> > > > +{
> > > > +  GOMP_DEVICE_UNINITIALIZED,
> > > > +  GOMP_DEVICE_INITIALIZED,
> > > > +  GOMP_DEVICE_FINALIZED
> > > > +};
> > > > +
> > > >  /* This structure describes accelerator device.
> > > >     It contains name of the corresponding libgomp plugin, function 
> > > > handlers for
> > > >     interaction with the device, ID-number of the device, and 
> > > > information about
> > > > @@ -933,8 +941,10 @@ struct gomp_device_descr
> > > >    /* Mutex for the mutable data.  */
> > > >    gomp_mutex_t lock;
> > > >  
> > > > -  /* Set to true when device is initialized.  */
> > > > -  bool is_initialized;
> > > > +  /* Current state of the device.  OpenACC allows to move from 
> > > > INITIALIZED state
> > > > +     back to UNINITIALIZED state.  OpenMP allows only to move from 
> > > > INITIALIZED
> > > > +     to FINALIZED state (at program shutdown).  */
> > > > +  enum gomp_device_state state;
> > > 
> > > (ACK, but I assume we'll want to make sure that an OpenACC device is
> > > never re-initialized if we're in/after the libgomp finalization phase.)
> > > 
> > > 
> > > The issue mentioned above: "exit inside of parallel" is actually a
> > > problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
> > > libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
> > > test cases now run into annoying "WARNING: program timed out".  Here is
> > > what's happening, as I understand it: in
> > > libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
> > > returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.
> > > 
> > > > --- a/libgomp/target.c
> > > > +++ b/libgomp/target.c
> > > 
> > > > +/* This function finalizes all initialized devices.  */
> > > > +
> > > > +static void
> > > > +gomp_target_fini (void)
> > > > +{
> > > > +  int i;
> > > > +  for (i = 0; i < num_devices; i++)
> > > > +    {
> > > > +      struct gomp_device_descr *devicep = &devices[i];
> > > > +      gomp_mutex_lock (&devicep->lock);
> > > > +      if (devicep->state == GOMP_DEVICE_INITIALIZED)
> > > > +       {
> > > > +         devicep->fini_device_func (devicep->target_id);
> > > > +         devicep->state = GOMP_DEVICE_FINALIZED;
> > > > +       }
> > > > +      gomp_mutex_unlock (&devicep->lock);
> > > > +    }
> > > > +}
> > > 
> > > > @@ -2387,6 +2433,9 @@ gomp_target_init (void)
> > > >        if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> > > >         goacc_register (&devices[i]);
> > > >      }
> > > > +
> > > > +  if (atexit (gomp_target_fini) != 0)
> > > > +    gomp_fatal ("atexit failed");
> > > >  }
> > > 
> > > Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> > > atexit handler, gomp_target_fini, which, with the device lock held, will
> > > call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> > > clean up.
> > > 
> > > Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> > > context is now in an inconsistent state, see
> > > <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html>:
> > > 
> > >     CUDA_ERROR_LAUNCH_FAILED = 719
> > >         An exception occurred on the device while executing a
> > >         kernel. Common causes include dereferencing an invalid device
> > >         pointer and accessing out of bounds shared memory. The context
> > >         cannot be used, so it must be destroyed (and a new one should be
> > >         created). All existing device memory allocations from this
> > >         context are invalid and must be reconstructed if the program is
> > >         to continue using CUDA.
> > > 
> > > Thus, any cuMemFreeHost invocations that are run during clean-up will now
> > > also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> > > GOMP_PLUGIN_fatal, which again will trigger the same or another
> > > (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> > > trying to lock the device again, which is still locked.
> > > 
> > > (Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar
> > > errors, should we destroy the context right away, or toggle a boolean
> > > flag to mark it as unusable, and use that as an indication to avoid the
> > > follow-on failures of cuMemFreeHost just described above, for example?)
> > > 
> > > <http://pubs.opengroup.org/onlinepubs/9699919799/functions/atexit.html>
> > > tells us:
> > > 
> > >     Since the behavior is undefined if the exit() function is called more
> > >     than once, portable applications calling atexit() must ensure that the
> > >     exit() function is not called at normal process termination when all
> > >     functions registered by the atexit() function are called.
> > > 
> > > ... which we're violating here, at least in the nvptx plugin.  I have not
> > > analyzed the intermic one.
> > > 
> > > As it happens, Chung-Lin has been working in that area:
> > > <http://news.gmane.org/find-root.php?message_id=%3C55DF1452.9050501%40codesourcery.com%3E>,
> > > which he recently re-posted:
> > > <http://news.gmane.org/find-root.php?message_id=%3C566EE49A.3050403%40codesourcery.com%3E>,
> > > <http://news.gmane.org/find-root.php?message_id=%3C566EC310.8000403%40codesourcery.com%3E>,
> > > <http://news.gmane.org/find-root.php?message_id=%3C566EC324.9050505%40codesourcery.com%3E>.
> > > I have not analyzed whether his changes would completely resolve the
> > > problem just described, but at least conceptually they seem to be a step
> > > into the right direction?  (Jakub?)
> > > 
> > > Now, to resolve the immediate problem, what is the right thing for us to
> > > do?  Is the following simple change OK, or is there a reason to still run
> > > atexit handlers if terminating under error conditions?
> > > 
> > > commit b1733e8f9df6ae7d6828e2194df1b314772701c5
> > > Author: Thomas Schwinge <tho...@codesourcery.com>
> > > Date:   Wed Dec 16 13:10:39 2015 +0100
> > > 
> > >     Avoid deadlocks in libgomp due to competing atexit handlers
> > >     
> > >           libgomp/
> > >           * error.c (gomp_vfatal): Call _exit instead of exit.
> > > ---
> > >  libgomp/error.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git libgomp/error.c libgomp/error.c
> > > index 094c24a..1ef7491 100644
> > > --- libgomp/error.c
> > > +++ libgomp/error.c
> > > @@ -34,6 +34,7 @@
> > >  #include <stdarg.h>
> > >  #include <stdio.h>
> > >  #include <stdlib.h>
> > > +#include <unistd.h>
> > >  
> > >  
> > >  #undef gomp_vdebug
> > > @@ -77,7 +78,7 @@ void
> > >  gomp_vfatal (const char *fmt, va_list list)
> > >  {
> > >    gomp_verror (fmt, list);
> > > -  exit (EXIT_FAILURE);
> > > +  _exit (EXIT_FAILURE);
> > >  }
> > >  
> > >  void


Grüße
 Thomas

Attachment: signature.asc
Description: PGP signature

Reply via email to