Definitely we are targeting ORTED failures here. If an ORTED fails than any other ORTEDs connected to it will notice and report the failure. Of course if the failure is an application than the ORTED on that node will be the only one to detect it.
Also, if an ORTED is lost, all of the applications running underneath it are also lost because we have no way to communicate with them anymore. On Tuesday, June 7, 2011 at 3:14 PM, Ralph Castain wrote: > Quick question: could you please clarify this statement: > > > ...because more than one ORTED could (and often will) detect the failure. > > I don't understand how this can be true, except for detecting an ORTED > failure. Only one orted can detect an MPI process failure, unless you have > now involved orted's in MPI communications (and I don't believe you did). If > the HNP directs another orted to restart that proc, and then that incarnation > fails, then the epoch number -should- increment again, shouldn't it? > > So are you concerned (re having the HNP mark a proc down multiple times) > about orted failure detection? In that case, I agree that you can have > multiple failure detections - we dealt with it differently in orcm, but I > have no issue with doing it another way. Just helps to know what problem you > are trying to solve. > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org (mailto:de...@open-mpi.org) > http://www.open-mpi.org/mailman/listinfo.cgi/devel