Nice catch. Keeping the first check only works because the signaling field prevent us from releasing the condition too early. I added some comments around the code (131fe42d).
George. On Wed, Sep 21, 2016 at 5:33 AM, Nathan Hjelm <hje...@me.com> wrote: > Yeah, that looks like a bug to me. We need to keep the check before the > lock but otherwise this is fine and should be fixed in 2.0.2. > > -Nathan > > > On Sep 21, 2016, at 3:16 AM, DEVEZE, PASCAL <pascal.dev...@atos.net> > wrote: > > > > I encountered a deadlock in sync_wait_mt(). > > > > After investigations, it appears that a first thread executing > wait_sync_update() decrements sync->count just after a second thread in > sync_wait_mt() made the test : > > > > if(sync->count <= 0) > > return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR; > > > > After that, there is a narrow window in which the first thread may call > pthread_cond_signal() before the second thread calls pthread_cond_wait(). > > > > If I protect this test by the sync->lock, this window is closed and the > problem does not reproduce. > > > > To easy reproduce the problem, just add a call to usleep(100) before the > call to pthread_mutex(&sync->lock); > > > > So my proposed patch is: > > > > diff --git a/opal/threads/wait_sync.c b/opal/threads/wait_sync.c > > index c9b9137..2f90965 100644 > > --- a/opal/threads/wait_sync.c > > +++ b/opal/threads/wait_sync.c > > @@ -25,12 +25,14 @@ static ompi_wait_sync_t* wait_sync_list = NULL; > > > > int sync_wait_mt(ompi_wait_sync_t *sync) > > { > > - if(sync->count <= 0) > > - return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR; > > - > > /* lock so nobody can signal us during the list updating */ > > pthread_mutex_lock(&sync->lock); > > > > + if(sync->count <= 0) { > > + pthread_mutex_unlock(&sync->lock); > > + return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR; > > + } > > + > > /* Insert sync on the list of pending synchronization constructs */ > > OPAL_THREAD_LOCK(&wait_sync_lock); > > if( NULL == wait_sync_list ) { > > > > For performance reasons, it is also possible to leave the first test > call. So if the request is terminated, we do not spend time to take and > free the lock. > > > > > > > > _______________________________________________ > > devel mailing list > > devel@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel