Ok, thanks.  In the meantime, please roll back to the v3.0.0 tag and you should 
be good.  Sorry for the hassle.  :-(


On Jun 25, 2014, at 12:19 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:

> Hi
> sorry for incomplete description. will trace problem more closely later next 
> week and provide.
> 
> M
> 
> 
> On Mon, Jun 23, 2014 at 10:13 PM, Jeff Squyres (jsquyres) 
> <jsquy...@cisco.com> wrote:
> Ok, just got in to Chicago from my flight and am back online.
> 
> Mike: you are still not providing very much information.  :-\
> 
> Your first mails make it seem like MTT is continuing to run, but leaving 
> "launchers" (assumedly mpirun processes) still running, but they have no 
> children.  Which would be very weird for mpirun to do, if it has no children 
> left.  This could be both an MTT and an ORTE bug, in this case.
> 
> But your last mail seems to imply that MTT is hanging indefinitely.
> 
> Can you please provide a clear, precise description of what is happening?
> 
> FWIW: Yes, we are killing the parent first now, to give mpirun a chance to 
> cleanup / tell remote orteds to die / kill children processes / etc.  Killing 
> the children first both doesn't test the common case of how people kill MPI 
> processes (i.e., they kill mpirun), and it also doesn't allow mpirun to tell 
> remote processes to die.
> 
> Do you run with --verbose output?  MTT should output messages like "*** 
> Killing mpirun with SIGTERM", and the like.  Do you see timeout messages at 
> all?  I.e., is MTT not entering the timeout code at all?
> 
> ...etc.
> 
> 
> 
> On Jun 23, 2014, at 12:16 PM, Dave Goodell (dgoodell) <dgood...@cisco.com> 
> wrote:
> 
> > On Jun 23, 2014, at 8:48 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
> >
> >> btw, i think now, when parent process is killed before child, OS makes 
> >> child as "<defunct>" which stick around for good.
> >
> > The grandparent should inherit the child.  If the grandparent then does not 
> > wait(2) on the child, then the child will remain a zombie / defunct.  So in 
> > our specific case, this behavior will depend on what the parent process of 
> > mpirun is and whether it is waiting on child processes appropriately.
> >
> > -Dave
> >
> > _______________________________________________
> > mtt-devel mailing list
> > mtt-de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0633.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0634.php
> 
> _______________________________________________
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0637.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to