Nice catch. Keeping the first check only works because the signaling field
prevent us from releasing the condition too early. I added some comments
around the code (131fe42d).

  George.

On Wed, Sep 21, 2016 at 5:33 AM, Nathan Hjelm <hje...@me.com> wrote:

> Yeah, that looks like a bug to me. We need to keep the check before the
> lock but otherwise this is fine and should be fixed in 2.0.2.
>
> -Nathan
>
> > On Sep 21, 2016, at 3:16 AM, DEVEZE, PASCAL <pascal.dev...@atos.net>
> wrote:
> >
> > I encountered a deadlock in sync_wait_mt().
> >
> > After investigations, it appears that a first thread executing
> wait_sync_update() decrements sync->count just after a second thread in
> sync_wait_mt() made the test :
> >
> >     if(sync->count <= 0)
> >         return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> >
> > After that, there is a narrow window in which the first thread may call
> pthread_cond_signal() before the second thread calls pthread_cond_wait().
> >
> > If I protect this test by the sync->lock, this window is closed and the
> problem does not reproduce.
> >
> > To easy reproduce the problem, just add a call to usleep(100) before the
> call to pthread_mutex(&sync->lock);
> >
> > So my proposed patch is:
> >
> > diff --git a/opal/threads/wait_sync.c b/opal/threads/wait_sync.c
> > index c9b9137..2f90965 100644
> > --- a/opal/threads/wait_sync.c
> > +++ b/opal/threads/wait_sync.c
> > @@ -25,12 +25,14 @@ static ompi_wait_sync_t* wait_sync_list = NULL;
> >
> > int sync_wait_mt(ompi_wait_sync_t *sync)
> > {
> > -    if(sync->count <= 0)
> > -        return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> > -
> >      /* lock so nobody can signal us during the list updating */
> >      pthread_mutex_lock(&sync->lock);
> >
> > +    if(sync->count <= 0) {
> > +        pthread_mutex_unlock(&sync->lock);
> > +        return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> > +    }
> > +
> >      /* Insert sync on the list of pending synchronization constructs */
> >      OPAL_THREAD_LOCK(&wait_sync_lock);
> >      if( NULL == wait_sync_list ) {
> >
> > For performance reasons, it is also possible to leave the first test
> call. So if the request is terminated, we do not spend time to take and
> free the lock.
> >
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to