Re: [OMPI devel] Orte cleanup

Ralph H Castain Wed, 5 Mar 2008 14:08:09 -0500

Wow, this took 4.5 hours to get through our Lab's email filter! You must
have been very bad recently. ;-))  Probably because you are being mean to my
poor little orteds...


We still don't have a reliable way for mpirun to detect that orteds have
crashed. I am working on some methods right now that look like they will
work in TM (and perhaps SLURM) environments, and am working on ensuring that
mpirun can reliably tell all the other orteds to die when this is detected.
There already is a mechanism in the system that was a first cut at ensuring
all orteds get the "die" message, but I'm not convinced it is truly robust
yet.

Brian contacted me via iPhone email to indicate that he might be willing to
restore the functionality whereby an app process would detect it had no RML
comm routes available any more and "abort". Hopefully, when he gets
somewhere where a longer message is possible, he can confirm (or deny!) that
understanding.

Let me know if I can be of help
Ralph



On 3/5/08 7:39 AM, "Aurélien Bouteiller" <[email protected]> wrote:

> Scenario 2 is definitely one of those we have been experienced (we are
> making some changes to orte and this lead some orted to crash). I will
> try to find a way to reproduce easily the other one, where aborted MPI
> processes are left behind (but no orted).
> 
> Thanks,
> Aurelien
> 
> 
> Le 5 mars 08 à 08:43, Ralph H Castain a écrit :
> 
>> Awesome. I haven't been seeing this behavior, but I won't swear that
>> it is
>> anywhere near fully tested.
>> 
>> A couple of possibilities come to mind:
>> 
>> 1. are you building threaded? If so, then all bets are off. The new
>> release
>> of orte depends heavily on libevent. As George pointed out on the Tues
>> telecon, libevent is definitely not thread safe. So, if you are
>> building
>> threaded, you can just about guarantee a problem will occur,
>> especially if
>> something crashes
>> 
>> 2. are the orteds crashing? If so, and you are using the tree routed
>> module
>> (which is the default), then application procs will be blocked from
>> finalizing since they will not be able to complete the barrier in
>> MPI_Finalize. That barrier relies on the RML to communicate between
>> each
>> process and the rank=0 process. In the tree routed module, all RML
>> communications is done through the local daemon - if that daemon
>> dies during
>> the job, then comm is broken. There currently is no recovery
>> mechanism, nor
>> does the OOB sense that the daemon socket is gone and abort the
>> proc. We
>> probably need to develop at least a method for doing the latter so
>> that
>> things don't just hang.
>> 
>> That is all I can think of immediately. If you can tell me more
>> about the
>> scenario, I can try to look at it.
>> 
>> Thanks
>> Ralph
>> 
>> 
>> 
>> On 3/4/08 9:37 PM, "Aurélien Bouteiller" <[email protected]>
>> wrote:
>> 
>>> I noticed that the new release of orte is not as good as it used to
>>> be
>>> to cleanup the mess left by crashed/aborted mpi processes. Recently
>>> We
>>> have been experiencing a lot of zombie or live locked processes
>>> running on the cluster nodes and disturbing following experiments. I
>>> didn't really had time to investigate the issue, maybe ralph can
>>> set a
>>> ticket if he is able to reproduce this.
>>> 
>>> Aurelien
>>> --
>>> * Dr. Aurélien Bouteiller
>>> * Sr. Research Associate at Innovative Computing Laboratory
>>> * University of Tennessee
>>> * 1122 Volunteer Boulevard, suite 350
>>> * Knoxville, TN 37996
>>> * 865 974 6321
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Orte cleanup

Reply via email to