btw, i think now, when parent process is killed before child, OS makes child as "<defunct>" which stick around for good.
On Mon, Jun 23, 2014 at 4:11 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > it seems that mpirun got no signal (no evidence in the log). mtt was > spinning and mpirun was a only process who left on the node. > It was unclear why mtt did not kill mpirun. > will try to extract perl stacktrace from mtt on tomorrow`s nightly run. > > > On Mon, Jun 23, 2014 at 2:59 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >> On Jun 23, 2014, at 7:47 AM, Mike Dubman <mi...@dev.mellanox.co.il> >> wrote: >> >> > after patch, it killed child processes but kept mpirun ... itself. >> >> What does that mean -- are you saying that mpirun is still running? Was >> mpirun sent a signal at all? What kind of messages are being displayed? >> ...etc. >> >> The commits fix important bugs for me and others. Clearly, there's still >> something not right. And of course I'm willing to track it down. But I >> can't help you if you just say "it doesn't work." >> >> > before that patch - all processes were killed (and you are right, >> "mpirun died right at the end of the timeout" was reported) >> >> ...which led to many months of misleading ORTE debugging, BTW. :-\ >> That's why this commit was introduced into MTT -- in the quest of finally >> fixing both the mysterious ORTE hangs and the erroneous timeouts/"mpirun >> died right at the end" messages. >> >> > but at least it left the cluster in the clean state w/o leftovers. >> > now many "orphan" launchers are alive from previous invocations. >> >> Does "launchers" = mpirun? >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> mtt-devel mailing list >> mtt-de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel >> Link to this post: >> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0629.php >> > >