I heard multiple references to pthread_cancel being known to have bad side effects. Can somebody educate my on this topic please?
Thanks, George. On Tue, May 13, 2014 at 10:25 PM, Ralph Castain <r...@open-mpi.org> wrote: > It could be a bug in the software stack, though I wouldn't count on it. > Unfortunately, pthread_cancel is known to have bad side effects, and so we > avoid its use. > > The key here is that the thread must detect that the file descriptor has > closed and exit, or use some other method for detecting that it should > terminate. We do this in multiple other places in the code, without using > pthread_cancel and without hanging. So it is certainly doable. > > I don't know the specifics of why Nathan's code is having trouble exiting, > but I suspect that a simple solution - not involving pthread_cancel - can be > readily developed. > > > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Folks, >> >> i would like to comment on r31738 : >> >>> There is no reason to cancel the listening thread. It should die >>> automatically when the file descriptor is closed. >> i could not agree more >>> It is sufficient to just wait for the thread to exit with pthread join. >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it >> is *not* :-( >> >> this is what i described in #4615 >> https://svn.open-mpi.org/trac/ompi/ticket/4615 >> in which i attached scif_hang.c that evidences that (at least in my >> environment) >> scif_poll(...) does *not* return after scif_close(...) is closed, and >> hence the scif pthread never ends. >> >> this is likely a bug in MPSS and it might have been fixed in earlier >> release. >> >> Nathan, could you try scif_hang in your environment and report the MPSS >> version you are running ? >> >> >> bottom line, and once again, in my test environment, pthread_join (...) >> without pthread_cancel(...) >> might cause a hang when the btl/scif module is released. >> >> >> assuming the bug is in old MPSS and has been fixed in recent releases, >> what is the OpenMPI policy ? >> a) test the MPSS version and call pthread_cancel() or do *not* call >> pthread_join if buggy MPSS is detected ? >> b) display an error/warning if a buggy MPSS is detected ? >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older >> MPSS, it is in MPI_Finalize() so impact is limited */ >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI >> problem after all ? >> e) something else ? >> >> Gilles >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php