Thanks - that helps!
On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland <wbl...@eecs.utk.edu> wrote: > Definitely we are targeting ORTED failures here. If an ORTED fails than > any other ORTEDs connected to it will notice and report the failure. Of > course if the failure is an application than the ORTED on that node will be > the only one to detect it. > > Also, if an ORTED is lost, all of the applications running underneath it > are also lost because we have no way to communicate with them anymore. > > On Tuesday, June 7, 2011 at 3:14 PM, Ralph Castain wrote: > > Quick question: could you please clarify this statement: > > ...because more than one ORTED could (and often will) detect the failure. > > > I don't understand how this can be true, except for detecting an ORTED > failure. Only one orted can detect an MPI process failure, unless you have > now involved orted's in MPI communications (and I don't believe you did). If > the HNP directs another orted to restart that proc, and then that > incarnation fails, then the epoch number -should- increment again, shouldn't > it? > > So are you concerned (re having the HNP mark a proc down multiple times) > about orted failure detection? In that case, I agree that you can have > multiple failure detections - we dealt with it differently in orcm, but I > have no issue with doing it another way. Just helps to know what problem you > are trying to solve. > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >