+1 - seen it before, and you'll find warnings across many software sites about 
this problem. Easy to have the main program segfault by touching the wrong 
thing after a cancel unless all the stars are properly aligned in the various 
libraries.



On May 13, 2014, at 7:56 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:

> George,
> 
> Just my USD0.02:
> 
> With pthreads many system calls (mostly those that might block) become 
> "cancellation points" where the implementation checks if the callinf thread 
> has been cancelled.
> This means that a thread making any of those calls may simply never return 
> (calling pthread_exit() internally), unless extra work has been done to 
> prevent this default behavior.
> This makes it very hard to write code that properly cleans up its resources, 
> including (but not limited to) file descriptors and malloc()ed memory.
> Even if Open MPI is written very carefully, one cannot assume that all the 
> libraries it calls (and their dependencies, etc.) are written to properly 
> deal with cancellation.
> 
> -Paul
> 
> 
> On Tue, May 13, 2014 at 7:32 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> I heard multiple references to pthread_cancel being known to have bad
> side effects. Can somebody educate my on this topic please?
> 
>   Thanks,
>     George.
> 
> 
> 
> On Tue, May 13, 2014 at 10:25 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > It could be a bug in the software stack, though I wouldn't count on it. 
> > Unfortunately, pthread_cancel is known to have bad side effects, and so we 
> > avoid its use.
> >
> > The key here is that the thread must detect that the file descriptor has 
> > closed and exit, or use some other method for detecting that it should 
> > terminate. We do this in multiple other places in the code, without using 
> > pthread_cancel and without hanging. So it is certainly doable.
> >
> > I don't know the specifics of why Nathan's code is having trouble exiting, 
> > but I suspect that a simple solution - not involving pthread_cancel - can 
> > be readily developed.
> >
> >
> > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet 
> > <gilles.gouaillar...@iferc.org> wrote:
> >
> >> Folks,
> >>
> >> i would like to comment on r31738 :
> >>
> >>> There is no reason to cancel the listening thread. It should die
> >>> automatically when the file descriptor is closed.
> >> i could not agree more
> >>> It is sufficient to just wait for the thread to exit with pthread join.
> >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
> >> is *not* :-(
> >>
> >> this is what i described in #4615
> >> https://svn.open-mpi.org/trac/ompi/ticket/4615
> >> in which i attached scif_hang.c that evidences that (at least in my
> >> environment)
> >> scif_poll(...) does *not* return after scif_close(...) is closed, and
> >> hence the scif pthread never ends.
> >>
> >> this is likely a bug in MPSS and it might have been fixed in earlier
> >> release.
> >>
> >> Nathan, could you try scif_hang in your environment and report the MPSS
> >> version you are running ?
> >>
> >>
> >> bottom line, and once again, in my test environment, pthread_join (...)
> >> without pthread_cancel(...)
> >> might cause a hang when the btl/scif module is released.
> >>
> >>
> >> assuming the bug is in old MPSS and has been fixed in recent releases,
> >> what is the OpenMPI policy ?
> >> a) test the MPSS version and call pthread_cancel() or do *not* call
> >> pthread_join if buggy MPSS is detected ?
> >> b) display an error/warning if a buggy MPSS is detected ?
> >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
> >> MPSS, it is in MPI_Finalize() so impact is limited */
> >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
> >> problem after all ?
> >> e) something else ?
> >>
> >> Gilles
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14788.php
> 
> 
> 
> -- 
> Paul H. Hargrove                          phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department     Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14790.php

Reply via email to