On Jun 23, 2014, at 7:47 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
> after patch, it killed child processes but kept mpirun ... itself. What does that mean -- are you saying that mpirun is still running? Was mpirun sent a signal at all? What kind of messages are being displayed? ...etc. The commits fix important bugs for me and others. Clearly, there's still something not right. And of course I'm willing to track it down. But I can't help you if you just say "it doesn't work." > before that patch - all processes were killed (and you are right, "mpirun > died right at the end of the timeout" was reported) ...which led to many months of misleading ORTE debugging, BTW. :-\ That's why this commit was introduced into MTT -- in the quest of finally fixing both the mysterious ORTE hangs and the erroneous timeouts/"mpirun died right at the end" messages. > but at least it left the cluster in the clean state w/o leftovers. > now many "orphan" launchers are alive from previous invocations. Does "launchers" = mpirun? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/