+1 - seen it before, and you'll find warnings across many software sites about this problem. Easy to have the main program segfault by touching the wrong thing after a cancel unless all the stars are properly aligned in the various libraries.
On May 13, 2014, at 7:56 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > George, > > Just my USD0.02: > > With pthreads many system calls (mostly those that might block) become > "cancellation points" where the implementation checks if the callinf thread > has been cancelled. > This means that a thread making any of those calls may simply never return > (calling pthread_exit() internally), unless extra work has been done to > prevent this default behavior. > This makes it very hard to write code that properly cleans up its resources, > including (but not limited to) file descriptors and malloc()ed memory. > Even if Open MPI is written very carefully, one cannot assume that all the > libraries it calls (and their dependencies, etc.) are written to properly > deal with cancellation. > > -Paul > > > On Tue, May 13, 2014 at 7:32 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > I heard multiple references to pthread_cancel being known to have bad > side effects. Can somebody educate my on this topic please? > > Thanks, > George. > > > > On Tue, May 13, 2014 at 10:25 PM, Ralph Castain <r...@open-mpi.org> wrote: > > It could be a bug in the software stack, though I wouldn't count on it. > > Unfortunately, pthread_cancel is known to have bad side effects, and so we > > avoid its use. > > > > The key here is that the thread must detect that the file descriptor has > > closed and exit, or use some other method for detecting that it should > > terminate. We do this in multiple other places in the code, without using > > pthread_cancel and without hanging. So it is certainly doable. > > > > I don't know the specifics of why Nathan's code is having trouble exiting, > > but I suspect that a simple solution - not involving pthread_cancel - can > > be readily developed. > > > > > > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet > > <gilles.gouaillar...@iferc.org> wrote: > > > >> Folks, > >> > >> i would like to comment on r31738 : > >> > >>> There is no reason to cancel the listening thread. It should die > >>> automatically when the file descriptor is closed. > >> i could not agree more > >>> It is sufficient to just wait for the thread to exit with pthread join. > >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it > >> is *not* :-( > >> > >> this is what i described in #4615 > >> https://svn.open-mpi.org/trac/ompi/ticket/4615 > >> in which i attached scif_hang.c that evidences that (at least in my > >> environment) > >> scif_poll(...) does *not* return after scif_close(...) is closed, and > >> hence the scif pthread never ends. > >> > >> this is likely a bug in MPSS and it might have been fixed in earlier > >> release. > >> > >> Nathan, could you try scif_hang in your environment and report the MPSS > >> version you are running ? > >> > >> > >> bottom line, and once again, in my test environment, pthread_join (...) > >> without pthread_cancel(...) > >> might cause a hang when the btl/scif module is released. > >> > >> > >> assuming the bug is in old MPSS and has been fixed in recent releases, > >> what is the OpenMPI policy ? > >> a) test the MPSS version and call pthread_cancel() or do *not* call > >> pthread_join if buggy MPSS is detected ? > >> b) display an error/warning if a buggy MPSS is detected ? > >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older > >> MPSS, it is in MPI_Finalize() so impact is limited */ > >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI > >> problem after all ? > >> e) something else ? > >> > >> Gilles > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14788.php > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14790.php