Scenario 2 is definitely one of those we have been experienced (we are
making some changes to orte and this lead some orted to crash). I will
try to find a way to reproduce easily the other one, where aborted MPI
processes are left behind (but no orted).
Thanks,
Aurelien
Le 5 mars 08 à 08:43, Ralph H Castain a écrit :
Awesome. I haven't been seeing this behavior, but I won't swear that
it is
anywhere near fully tested.
A couple of possibilities come to mind:
1. are you building threaded? If so, then all bets are off. The new
release
of orte depends heavily on libevent. As George pointed out on the Tues
telecon, libevent is definitely not thread safe. So, if you are
building
threaded, you can just about guarantee a problem will occur,
especially if
something crashes
2. are the orteds crashing? If so, and you are using the tree routed
module
(which is the default), then application procs will be blocked from
finalizing since they will not be able to complete the barrier in
MPI_Finalize. That barrier relies on the RML to communicate between
each
process and the rank=0 process. In the tree routed module, all RML
communications is done through the local daemon - if that daemon
dies during
the job, then comm is broken. There currently is no recovery
mechanism, nor
does the OOB sense that the daemon socket is gone and abort the
proc. We
probably need to develop at least a method for doing the latter so
that
things don't just hang.
That is all I can think of immediately. If you can tell me more
about the
scenario, I can try to look at it.
Thanks
Ralph
On 3/4/08 9:37 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>
wrote:
I noticed that the new release of orte is not as good as it used to
be
to cleanup the mess left by crashed/aborted mpi processes. Recently
We
have been experiencing a lot of zombie or live locked processes
running on the cluster nodes and disturbing following experiments. I
didn't really had time to investigate the issue, maybe ralph can
set a
ticket if he is able to reproduce this.
Aurelien
--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel