Okay, finally have time to sit down and review this. It looks pretty much 
identical to what was done in ORCM - we just kept "epoch" separate from the 
process name, and use multicast to notify all procs that someone failed. I do 
have a few questions/comments about your proposed patch:

1. I note that in some places you just set peer_name.epoch = proc_name.epoch, 
and in others you make the assignment by calling a new API 
orte_ess.proc_get_epoch(&proc_name). Ditto for proc_set_epoch. What are the 
rules for when each method should be used? Which leads to...

2. I'm puzzled as to why you are storing process state and epoch number in the 
modex as well as in the process name and orte_proc_t struct. This creates a bit 
of a race condition as the two will be out-of-sync for some (probably small) 
period of time, and looks like unnecessary duplication. Is there some reason 
for doing this? We are trying to eliminate duplicate storage because of the 
data confusion and memory issues, hence my question.

3. as a follow on to #2, I am bothered that we now have the ESS storing proc 
state. That isn't the functional purpose of the ESS - that's a PLM function. Is 
there some reason for doing this in the ESS? Why aren't we just looking at the 
orte_proc_t for that proc and using its state field? I guess I can understand 
if you want to get that via an API (instead of having code to lookup the proc_t 
in multiple places), but then let's put it in the PLM please. I note that it is 
only used in the binomial routing code, so why not just put a static function 
in there to get the state of a proc rather than creating another API?

4. ess_base_open.c: the default orte_ess module appears to be missing an entry 
for proc_set_epoch.

5. I really don't think that notification of proc failure belongs in the 
orted_comm - messages notifying of proc failure should be received in the 
errmgr. This allows people who want to handle things differently (e.g., orcm) 
the ability to create their own errmgr component(s) for daemons and HNP that 
send the messages over their desired messaging system, decide how they want to 
respond, etc. Putting it in orted_comm forces everyone to use only this one 
method, which conflicts with allowing freedom for others to explore alternative 
methods, and frankly, I don't see any strong reason that outweighs that 
limitation.

6. I don't think this errmgr_fault_callback registration is going to work, per 
my response to Josh's RFC. I'll leave the discussion in that thread.


On Jun 6, 2011, at 1:00 PM, George Bosilca wrote:

> WHAT: Allow the runtime to handle fail-stop failures for both runtime 
> (daemons) or application level processes. This patch extends the 
> orte_process_name_t structure with a field to store the process epoch (the 
> number of times it died so far), and add an application failure notification 
> callback function to be registered in the runtime. 
> 
> WHY: Necessary to correctly implement the error handling in the MPI 2.2 
> standard. In addition, such a resilient runtime is a cornerstone for any 
> level of fault tolerance support we want to provide in the future (such as 
> the MPI-3 Run-Through Stabilization or FT-MPI).
> 
> WHEN:
> 
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
> 
> ------
> 
> MORE DETAILS:
> 
> Currently the infrastructure required to enable any kind of fault tolerance 
> development in Open MPI (with the exception of the checkpoint/restart) is 
> missing. However, before developing any fault tolerant support at the 
> application (MPI) level, we need to have a resilient runtime. The changes in 
> this patch address this lack of support and would allow anyone to implement a 
> fault tolerance protocol at the MPI layer without having to worry about the 
> ORTE stabilization.
> 
> This patch will allow the runtime to drop any dead daemons, and re-route all 
> communications around the holes in order to __ALWAYS__ deliver a message as 
> long as the destination process is alive. The application is informed (via a 
> callback) about the loss of the processes with the same jobid. In this patch 
> we do not address the MPI_ERROR_RETURN type of failures, we focused on the 
> MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the 
> decision, instead of taking it down in the runtime.
> 
> NEW STUFF:
> 
> Epoch - A counter that tracks the number of times a process has been detected 
> to have terminated, either from a failure or an expected termination. After 
> the termination is detected, the HNP coordinates all other process’s 
> knowledge of the new epoch. Each ORTED will know the epoch of the other 
> processes in the job, but it will not actually store anything until the 
> epochs change. 
> 
> Run-Through Stabilization - When an ORTED (or HNP) detects that another 
> process has terminated, it repairs the routing layer and informs the HNP. The 
> HNP tells all other processes about the failure so they can also repair their 
> routing layers an update their internal bookkeeping. The processes do not 
> abort after the termination is detected.
> 
> Callback Function - When the HNP tells all the ORTEDs about the failures, 
> they tell the ORTE layers within the applications. The application level ORTE 
> layers have a callback function that they use to inform the OMPI layer about 
> the error. Currently the OMPI errhandler code fills in this callback function 
> so it is informed when there is an error and it aborts (to maintain the 
> current default behavior of MPI). This callback function can also be used in 
> an ORTE only application to perform application based fault tolerance (ABFT) 
> and allow the application to continue.
> 
> NECESSARY FOR IMPLEMENTATION:
> 
> Epoch - The orte_process_name_t struct now has a field for epoch. This means 
> that whenever sending a message, the most current version of the epoch needs 
> to be in this field. This is a simple look up using the function in 
> orte/util/nidmap.c: orte_util_lookup_epoch(). In the orte/orted/orted_comm.c 
> code, there is a check to make sure that it isn’t trying to send messages to 
> a process that has already terminated (don’t send to a process with an epoch 
> less than the current epoch). Make sure that if you are sending a message, 
> you have the most up to date data here.
> 
> Routing - So far, only the binomial routing layer has been updated to use the 
> new resilience features. To modify other routing layers to be able to 
> continue running after a process failure, they need to be able to detect 
> which processes are not currently running and route around them. The errmgr 
> gives the routing layer two chances to do this. First it calls delete_route 
> for each process that fails, then it calls update_routing_tree after it has 
> appropriately marked each process. Before either of these things happen the 
> epoch and process state have already been updates so the routing layer can 
> use this data to determine which processes are alive and which are dead. A 
> convenience function has been added to orte/util/nidmap.h called 
> orte_util_proc_is_running() which allows the ORTEDs to determine the status 
> of a process. Keep in mind that a process is not running if it hasn’t started 
> up yet so it is wise to check the epoch (to make sure that it isn’t 
> ORTE_EPOCH_MIN) as well to make sure that you’re actually detecting an error 
> and not just noticing that an ORTED hasn’t finished starting.
> 
> Callback - If you want to implement some sort of fault tolerance on top of 
> this code, use the callback function in the errmgr framework. There is a new 
> function in the errmgr code called set_fault_callback that takes a function 
> pointer. The ompi_init code sets this to a default value just after it calls 
> orte_init (to make sure that there is an errmgr to call into). If you later 
> set this to a new function, you will get the callback to notify you of 
> process failures. Remember that you’ll need to handle any sort of MPI level 
> fault tolerance at this point because you’ve taken away the callback for the 
> OMPI layer.
> 
> 
> 
> <orte-resil.patch.bz2>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to