it seems that mpirun got no signal (no evidence in the log). mtt was
spinning and mpirun was a only process who left on the node.
It was unclear why mtt did not kill mpirun.
will try to extract perl stacktrace from mtt on tomorrow`s nightly run.


On Mon, Jun 23, 2014 at 2:59 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> On Jun 23, 2014, at 7:47 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
>
> > after patch, it killed child processes but kept mpirun ... itself.
>
> What does that mean -- are you saying that mpirun is still running?  Was
> mpirun sent a signal at all?  What kind of messages are being displayed?
>  ...etc.
>
> The commits fix important bugs for me and others.  Clearly, there's still
> something not right.  And of course I'm willing to track it down.  But I
> can't help you if you just say "it doesn't work."
>
> > before that patch - all processes were killed (and you are right,
> "mpirun died right at the end of the timeout" was reported)
>
> ...which led to many months of misleading ORTE debugging, BTW.  :-\
>  That's why this commit was introduced into MTT -- in the quest of finally
> fixing both the mysterious ORTE hangs and the erroneous timeouts/"mpirun
> died right at the end" messages.
>
> > but at least it left the cluster in the clean state w/o leftovers.
> > now many "orphan" launchers  are alive from previous invocations.
>
> Does "launchers" = mpirun?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post:
> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0629.php
>

Reply via email to