Quick question: could you please clarify this statement: ...because more than one ORTED could (and often will) detect the failure. >
I don't understand how this can be true, except for detecting an ORTED failure. Only one orted can detect an MPI process failure, unless you have now involved orted's in MPI communications (and I don't believe you did). If the HNP directs another orted to restart that proc, and then that incarnation fails, then the epoch number -should- increment again, shouldn't it? So are you concerned (re having the HNP mark a proc down multiple times) about orted failure detection? In that case, I agree that you can have multiple failure detections - we dealt with it differently in orcm, but I have no issue with doing it another way. Just helps to know what problem you are trying to solve.