Just to add to Josh's comment: I am working now on recovering from HNP failure as well. Should have that in a month or so.
On Thu, Feb 25, 2010 at 10:46 AM, Josh Hursey <jjhur...@open-mpi.org> wrote: > > On Feb 25, 2010, at 8:32 AM, George Bosilca wrote: > > > > > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > > > >> Hum... I'm really afraid about this. I understand your choice since it > is really a good solution for fail/stop/restart behaviour, but looking from > the fail/recovery side, can you envision some alternative for the orted's > reconfiguration on the fly? > > > > Leonardo, > > > > I don't see why the current code prohibit such behavior. However, I don't > see right now in this branch how the remaining daemons (and MPI processes) > reconstruct the communication topology, but this is just a technicality. > > If you use the 'cm' routed component then the reconstruction of the ORTE > level communication works for all but the loss of the HNP. Neither Ralph or > I have looked at supporting other routed components at this time. I know > your group at UTK has some done work in this area so we wanted to tackle > additional support for more scalable routed components as a second step, > hopefully with collaboration from your group. > > As far as the MPI layer, I can't say much at this point on how that works. > This RFC only handles recovery of the ORTE layer, MPI layer recovery is a > second step and involves much longer discussions. I have a solution for a > certain type of MPI application, and it sounds like you have something that > can be applied more generally. > > > > > Anyway, this is the code that UT will bring in. All our work focus on > maintaining the exiting environment up and running instead of restarting > everything. The orted will auto-heal (i.e reshape the underlying topology, > recreate the connections, and so on), and the fault is propagated to the MPI > layer who will take the decision on what to do next. > > Per my previous suggestion, would it be useful to chat on the phone early > next week about our various strategies? > > -- Josh > > > > > > george. > > > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >