Re: [OMPI devel] RFC: Resilient ORTE

Wesley Bland Tue, 7 Jun 2011 10:46:30 -0400

This could certainly work alongside another ORCM or any other fault 
detection/prediction/recovery mechanism. Most of the code is just dedicated to 
keeping the epoch up to date and tracking the status of the processes. The 
underlying idea was to provide a way for the application to decide what its 
fault policy would be rather than trying to dictate one in the runtime. If any 
other layer wanted to register a callback function with this code, it could do 
anything it wanted to on top of it.


Wesley

On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:

> I'm on travel this week, but will look this over when I return. From the 
> description, it sounds nearly identical to what we did in ORCM, so I expect 
> there won't be many issues. You do get some race conditions that the new 
> state machine code should help resolve.
> 
> Only difference I can quickly see is that we chose not to modify the process 
> name structure, keeping the "epoch" (we called it "incarnation") as a 
> separate value. Since we aren't terribly concerned about backward 
> compatibility, I don't consider this a significant issue - but something the 
> community should recognize. 
> 
> My main concern will be to ensure that the new code contains enough 
> flexibility to allow integration with other layers such as ORCM without 
> creating potential conflict over "double protection" - i.e., if the layer 
> above ORTE wants to provide a certain level of fault protection, then ORTE 
> needs to get out of the way. 
> 
> 
> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca <bosi...@eecs.utk.edu 
> (mailto:bosi...@eecs.utk.edu)> wrote:
> >  WHAT: Allow the runtime to handle fail-stop failures for both runtime 
> > (daemons) or application level processes. This patch extends the 
> > orte_process_name_t structure with a field to store the process epoch (the 
> > number of times it died so far), and add an application failure 
> > notification callback function to be registered in the runtime.
> > 
> >  WHY: Necessary to correctly implement the error handling in the MPI 2.2 
> > standard. In addition, such a resilient runtime is a cornerstone for any 
> > level of fault tolerance support we want to provide in the future (such as 
> > the MPI-3 Run-Through Stabilization or FT-MPI).
> > 
> >  WHEN:
> > 
> >  WHERE: Patch attached to this email, based on trunk r24747.
> >  TIMEOUT: 2 weeks from now, on Monday 20 June.
> > 
> >  ------
> > 
> >  MORE DETAILS:
> > 
> >  Currently the infrastructure required to enable any kind of fault 
> > tolerance development in Open MPI (with the exception of the 
> > checkpoint/restart) is missing. However, before developing any fault 
> > tolerant support at the application (MPI) level, we need to have a 
> > resilient runtime. The changes in this patch address this lack of support 
> > and would allow anyone to implement a fault tolerance protocol at the MPI 
> > layer without having to worry about the ORTE stabilization.
> > 
> >  This patch will allow the runtime to drop any dead daemons, and re-route 
> > all communications around the holes in order to __ALWAYS__ deliver a 
> > message as long as the destination process is alive. The application is 
> > informed (via a callback) about the loss of the processes with the same 
> > jobid. In this patch we do not address the MPI_ERROR_RETURN type of 
> > failures, we focused on the MPI_ERROR_ABORT ones. Moreover, we empowered 
> > the application level with the decision, instead of taking it down in the 
> > runtime.
> > 
> >  NEW STUFF:
> > 
> >  Epoch - A counter that tracks the number of times a process has been 
> > detected to have terminated, either from a failure or an expected 
> > termination. After the termination is detected, the HNP coordinates all 
> > other process’s knowledge of the new epoch. Each ORTED will know the epoch 
> > of the other processes in the job, but it will not actually store anything 
> > until the epochs change.
> > 
> >  Run-Through Stabilization - When an ORTED (or HNP) detects that another 
> > process has terminated, it repairs the routing layer and informs the HNP. 
> > The HNP tells all other processes about the failure so they can also repair 
> > their routing layers an update their internal bookkeeping. The processes do 
> > not abort after the termination is detected.
> > 
> >  Callback Function - When the HNP tells all the ORTEDs about the failures, 
> > they tell the ORTE layers within the applications. The application level 
> > ORTE layers have a callback function that they use to inform the OMPI layer 
> > about the error. Currently the OMPI errhandler code fills in this callback 
> > function so it is informed when there is an error and it aborts (to 
> > maintain the current default behavior of MPI). This callback function can 
> > also be used in an ORTE only application to perform application based fault 
> > tolerance (ABFT) and allow the application to continue.
> > 
> >  NECESSARY FOR IMPLEMENTATION:
> > 
> >  Epoch - The orte_process_name_t struct now has a field for epoch. This 
> > means that whenever sending a message, the most current version of the 
> > epoch needs to be in this field. This is a simple look up using the 
> > function in orte/util/nidmap.c: orte_util_lookup_epoch(). In the 
> > orte/orted/orted_comm.c code, there is a check to make sure that it isn’t 
> > trying to send messages to a process that has already terminated (don’t 
> > send to a process with an epoch less than the current epoch). Make sure 
> > that if you are sending a message, you have the most up to date data here.
> > 
> >  Routing - So far, only the binomial routing layer has been updated to use 
> > the new resilience features. To modify other routing layers to be able to 
> > continue running after a process failure, they need to be able to detect 
> > which processes are not currently running and route around them. The errmgr 
> > gives the routing layer two chances to do this. First it calls delete_route 
> > for each process that fails, then it calls update_routing_tree after it has 
> > appropriately marked each process. Before either of these things happen the 
> > epoch and process state have already been updates so the routing layer can 
> > use this data to determine which processes are alive and which are dead. A 
> > convenience function has been added to orte/util/nidmap.h called 
> > orte_util_proc_is_running() which allows the ORTEDs to determine the status 
> > of a process. Keep in mind that a process is not running if it hasn’t 
> > started up yet so it is wise to check the epoch (to make sure that it isn’t 
> > ORTE_EPOCH_MIN) as well to make sure that you’re actually detecting an 
> > error and not just noticing that an ORTED hasn’t finished starting.
> > 
> >  Callback - If you want to implement some sort of fault tolerance on top of 
> > this code, use the callback function in the errmgr framework. There is a 
> > new function in the errmgr code called set_fault_callback that takes a 
> > function pointer. The ompi_init code sets this to a default value just 
> > after it calls orte_init (to make sure that there is an errmgr to call 
> > into). If you later set this to a new function, you will get the callback 
> > to notify you of process failures. Remember that you’ll need to handle any 
> > sort of MPI level fault tolerance at this point because you’ve taken away 
> > the callback for the OMPI layer.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> >  devel mailing list
> > de...@open-mpi.org (mailto:de...@open-mpi.org)
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org (mailto:de...@open-mpi.org)
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Resilient ORTE

Reply via email to