Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

George Bosilca Thu, 25 Feb 2010 11:32:10 -0500

On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:

> Hum... I'm really afraid about this. I understand your choice since it is 
> really a good solution for fail/stop/restart behaviour, but looking from the 
> fail/recovery side, can you envision some alternative for the orted's 
> reconfiguration on the fly?


Leonardo,

I don't see why the current code prohibit such behavior. However, I don't see 
right now in this branch how the remaining daemons (and MPI processes) 
reconstruct the communication topology, but this is just a technicality.

Anyway, this is the code that UT will bring in. All our work focus on 
maintaining the exiting environment up and running instead of restarting 
everything. The orted will auto-heal (i.e reshape the underlying topology, 
recreate the connections, and so on), and the fault is propagated to the MPI 
layer who will take the decision on what to do next.

  george.

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Reply via email to