Re: [OMPI devel] RFC: Resilient ORTE

Ralph Castain Tue, 7 Jun 2011 18:35:34 -0400

Thanks - that helps!


On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland <[email protected]> wrote:

>  Definitely we are targeting ORTED failures here. If an ORTED fails than
> any other ORTEDs connected to it will notice and report the failure. Of
> course if the failure is an application than the ORTED on that node will be
> the only one to detect it.
>
> Also, if an ORTED is lost, all of the applications running underneath it
> are also lost because we have no way to communicate with them anymore.
>
> On Tuesday, June 7, 2011 at 3:14 PM, Ralph Castain wrote:
>
> Quick question: could you please clarify this statement:
>
> ...because more than one ORTED could (and often will) detect the failure.
>
>
> I don't understand how this can be true, except for detecting an ORTED
> failure. Only one orted can detect an MPI process failure, unless you have
> now involved orted's in MPI communications (and I don't believe you did). If
> the HNP directs another orted to restart that proc, and then that
> incarnation fails, then the epoch number -should- increment again, shouldn't
> it?
>
> So are you concerned (re having the HNP mark a proc down multiple times)
> about orted failure detection? In that case, I agree that you can have
> multiple failure detections - we dealt with it differently in orcm, but I
> have no issue with doing it another way. Just helps to know what problem you
> are trying to solve.
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] RFC: Resilient ORTE

Reply via email to