Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Josh Hursey Thu, 25 Feb 2010 12:46:24 -0500

On Feb 25, 2010, at 8:32 AM, George Bosilca wrote:

> 
> On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
> 
>> Hum... I'm really afraid about this. I understand your choice since it is 
>> really a good solution for fail/stop/restart behaviour, but looking from the 
>> fail/recovery side, can you envision some alternative for the orted's 
>> reconfiguration on the fly?
> 
> Leonardo,
> 
> I don't see why the current code prohibit such behavior. However, I don't see 
> right now in this branch how the remaining daemons (and MPI processes) 
> reconstruct the communication topology, but this is just a technicality.


If you use the 'cm' routed component then the reconstruction of the ORTE level 
communication works for all but the loss of the HNP. Neither Ralph or I have 
looked at supporting other routed components at this time. I know your group at 
UTK has some done work in this area so we wanted to tackle additional support 
for more scalable routed components as a second step, hopefully with 
collaboration from your group.

As far as the MPI layer, I can't say much at this point on how that works. This 
RFC only handles recovery of the ORTE layer, MPI layer recovery is a second 
step and involves much longer discussions. I have a solution for a certain type 
of MPI application, and it sounds like you have something that can be applied 
more generally.

> 
> Anyway, this is the code that UT will bring in. All our work focus on 
> maintaining the exiting environment up and running instead of restarting 
> everything. The orted will auto-heal (i.e reshape the underlying topology, 
> recreate the connections, and so on), and the fault is propagated to the MPI 
> layer who will take the decision on what to do next.

Per my previous suggestion, would it be useful to chat on the phone early next 
week about our various strategies?

-- Josh


> 
>  george.
> 
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Reply via email to