It sounds more like a suboptimal usage of the pthread cancelation helpers than a real issue with the pthread_cancel itself. I do agree the usage is not necessarily straightforward even for a veteran coder, but the related issues remain belong to the realm of implementation not at the conceptual level.
George. On Tue, May 13, 2014 at 10:56 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > George, > > Just my USD0.02: > > With pthreads many system calls (mostly those that might block) become > "cancellation points" where the implementation checks if the callinf thread > has been cancelled. > This means that a thread making any of those calls may simply never return > (calling pthread_exit() internally), unless extra work has been done to > prevent this default behavior. > This makes it very hard to write code that properly cleans up its resources, > including (but not limited to) file descriptors and malloc()ed memory. > Even if Open MPI is written very carefully, one cannot assume that all the > libraries it calls (and their dependencies, etc.) are written to properly > deal with cancellation. > > -Paul > > > On Tue, May 13, 2014 at 7:32 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >> I heard multiple references to pthread_cancel being known to have bad >> side effects. Can somebody educate my on this topic please? >> >> Thanks, >> George. >> >> >> >> On Tue, May 13, 2014 at 10:25 PM, Ralph Castain <r...@open-mpi.org> wrote: >> > It could be a bug in the software stack, though I wouldn't count on it. >> > Unfortunately, pthread_cancel is known to have bad side effects, and so we >> > avoid its use. >> > >> > The key here is that the thread must detect that the file descriptor has >> > closed and exit, or use some other method for detecting that it should >> > terminate. We do this in multiple other places in the code, without using >> > pthread_cancel and without hanging. So it is certainly doable. >> > >> > I don't know the specifics of why Nathan's code is having trouble >> > exiting, but I suspect that a simple solution - not involving >> > pthread_cancel >> > - can be readily developed. >> > >> > >> > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet >> > <gilles.gouaillar...@iferc.org> wrote: >> > >> >> Folks, >> >> >> >> i would like to comment on r31738 : >> >> >> >>> There is no reason to cancel the listening thread. It should die >> >>> automatically when the file descriptor is closed. >> >> i could not agree more >> >>> It is sufficient to just wait for the thread to exit with pthread >> >>> join. >> >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it >> >> is *not* :-( >> >> >> >> this is what i described in #4615 >> >> https://svn.open-mpi.org/trac/ompi/ticket/4615 >> >> in which i attached scif_hang.c that evidences that (at least in my >> >> environment) >> >> scif_poll(...) does *not* return after scif_close(...) is closed, and >> >> hence the scif pthread never ends. >> >> >> >> this is likely a bug in MPSS and it might have been fixed in earlier >> >> release. >> >> >> >> Nathan, could you try scif_hang in your environment and report the MPSS >> >> version you are running ? >> >> >> >> >> >> bottom line, and once again, in my test environment, pthread_join (...) >> >> without pthread_cancel(...) >> >> might cause a hang when the btl/scif module is released. >> >> >> >> >> >> assuming the bug is in old MPSS and has been fixed in recent releases, >> >> what is the OpenMPI policy ? >> >> a) test the MPSS version and call pthread_cancel() or do *not* call >> >> pthread_join if buggy MPSS is detected ? >> >> b) display an error/warning if a buggy MPSS is detected ? >> >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older >> >> MPSS, it is in MPI_Finalize() so impact is limited */ >> >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI >> >> problem after all ? >> >> e) something else ? >> >> >> >> Gilles >> >> _______________________________________________ >> >> devel mailing list >> >> de...@open-mpi.org >> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Link to this post: >> >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14788.php > > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14790.php