On Jun 23, 2014, at 7:47 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:

> after patch, it killed child processes but kept mpirun ... itself.

What does that mean -- are you saying that mpirun is still running?  Was mpirun 
sent a signal at all?  What kind of messages are being displayed?  ...etc.

The commits fix important bugs for me and others.  Clearly, there's still 
something not right.  And of course I'm willing to track it down.  But I can't 
help you if you just say "it doesn't work."

> before that patch - all processes were killed (and you are right, "mpirun 
> died right at the end of the timeout" was reported)

...which led to many months of misleading ORTE debugging, BTW.  :-\  That's why 
this commit was introduced into MTT -- in the quest of finally fixing both the 
mysterious ORTE hangs and the erroneous timeouts/"mpirun died right at the end" 
messages.

> but at least it left the cluster in the clean state w/o leftovers.
> now many "orphan" launchers  are alive from previous invocations.

Does "launchers" = mpirun?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to