> > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the value sounds similar, your explanations are > beginning to sound very different from what we are doing and/or had > envisioned. > > I'm not sure how you can talk about an epoch being too high or too low, > unless you are envisioning an overall system where procs try to maintain some > global notion of the value - which sounds like a race condition begging to > cause problems. > > > >
When we say epoch we mean a value that is stored locally. When a failure is detected the detector notifies the HNP who notifies everyone else. Thus everyone will _eventually_ receive the notification that the process has failed. It may take a while for you to receive the notification, but in the meantime you will behave normally. When you do receive the notification that the failure occurred, you update your local copy of the epoch. This is similar to the definition of the "perfect" failure detector that Josh references. It doesn't matter if you don't find about the failure immediately, as long as you find out about it eventually. If you aren't actually in the same jobid as the failed process you might never find out about the failure because it does not apply to you. > Are you then thinking that MPI processes are going to detect failure instead > of local orteds?? Right now, no MPI process would ever report failure of a > peer - the orted detects failure using the sigchild and reports it. What > mechanism would the MPI procs use, and how would that be more reliable than > sigchild?? > > > Definitely not. ORTEDs are the processes that detect and report the failures. They can detect the failure of other ORTEDs or of applications. Basically anything to which they have a connection. > > So right now the HNP can -never- receive more than one failure report at a > time for a process. The only issue we've been working is that there are > several pathways for reporting that error - e.g., if the orted detects the > process fails and reports it, and then the orted itself fails, we can get > multiple failure events back at the HNP before we respond to the first one. > > Not the same issue as having MPI procs reporting failures... This is where the epoch becomes necessary. When reporting a failure, you tell the HNP which process failed by name, including the epoch. Thus the HNP will not make a process as having failed twice (thus incrementing the epoch twice and notifying everyone about the failure twice). The HNP might receive multiple notifications because more than one ORTED could (and often will) detect the failure. It is easier to have the HNP decide what is a failure and what is a duplicate rather than have the ORTEDs reach some consensus about the fact that a process has failed. Much less overhead this way. > > > > I'm not sure what ORCM does in the respect, but I don't know of anything in > > ORTE that would track this data other than the process state and that > > doesn't keep track of anything beyond one failure (which admittedly isn't > > an issue until we implement process recovery). > > We aren't having any problems with process recovery and process state - > without tracking epochs. We only track "incarnations" so that we can pass it > down to the apps, which use that info to guide their restart. > > Could you clarify why you are having a problem in this regard? Might help to > better understand your proposed changes. I think we're talking about the same thing here. The only difference is that I'm not looking at the ORCM code so I don't have the "incarnations".