Ok, just got in to Chicago from my flight and am back online.

Mike: you are still not providing very much information.  :-\

Your first mails make it seem like MTT is continuing to run, but leaving 
"launchers" (assumedly mpirun processes) still running, but they have no 
children.  Which would be very weird for mpirun to do, if it has no children 
left.  This could be both an MTT and an ORTE bug, in this case.

But your last mail seems to imply that MTT is hanging indefinitely.

Can you please provide a clear, precise description of what is happening?

FWIW: Yes, we are killing the parent first now, to give mpirun a chance to 
cleanup / tell remote orteds to die / kill children processes / etc.  Killing 
the children first both doesn't test the common case of how people kill MPI 
processes (i.e., they kill mpirun), and it also doesn't allow mpirun to tell 
remote processes to die.

Do you run with --verbose output?  MTT should output messages like "*** Killing 
mpirun with SIGTERM", and the like.  Do you see timeout messages at all?  I.e., 
is MTT not entering the timeout code at all?

...etc.



On Jun 23, 2014, at 12:16 PM, Dave Goodell (dgoodell) <dgood...@cisco.com> 
wrote:

> On Jun 23, 2014, at 8:48 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
> 
>> btw, i think now, when parent process is killed before child, OS makes child 
>> as "<defunct>" which stick around for good.
> 
> The grandparent should inherit the child.  If the grandparent then does not 
> wait(2) on the child, then the child will remain a zombie / defunct.  So in 
> our specific case, this behavior will depend on what the parent process of 
> mpirun is and whether it is waiting on child processes appropriately.
> 
> -Dave
> 
> _______________________________________________
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0633.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to