I looked through the patch a bit more today and had a few notes/questions. - orte_errmgr.post_startup() start the persistent RML message. There does not seem to be a shutdown version of this (to deregister the RML message at orte_finalize time). Was this intentional, or just missed? - in the orte_errmgr.set_fault_callback: it would be nice if it returned the previous callback, so you could layer more than one 'thing' on top of ORTE and have them chain in a sigaction-like manner. - orte_process_info.max_procs: this seems to be only used in the binomial routed, but I was a bit unclear about its purpose. Can you describe what it does, and how it is used? - in orted_comm.c: you process the ORTE_PROCESS_FAILED_NOTIFICATION message here. Why not push all of that logic into the errmgr components? It is not a big deal, just curious.
I'll probably send more notes after some more digging and testing of the code. But the patch is looking good. Good work! -- Josh On Tue, Jun 7, 2011 at 10:51 AM, Josh Hursey <jjhur...@open-mpi.org> wrote: > I briefly looked over the patch. Excluding the epochs (which we don't > need now, but will soon) it looks similar to what I have setup on my > MPI run-through stabilization branch - so it should support that work > nicely. I'll try to test it this week and send back any other > comments. > > Good work. > > Thanks, > Josh > > On Tue, Jun 7, 2011 at 10:46 AM, Wesley Bland <wbl...@eecs.utk.edu> wrote: >> This could certainly work alongside another ORCM or any other fault >> detection/prediction/recovery mechanism. Most of the code is just dedicated >> to keeping the epoch up to date and tracking the status of the processes. >> The underlying idea was to provide a way for the application to decide what >> its fault policy would be rather than trying to dictate one in the runtime. >> If any other layer wanted to register a callback function with this code, it >> could do anything it wanted to on top of it. >> Wesley >> >> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote: >> >> I'm on travel this week, but will look this over when I return. From the >> description, it sounds nearly identical to what we did in ORCM, so I expect >> there won't be many issues. You do get some race conditions that the new >> state machine code should help resolve. >> Only difference I can quickly see is that we chose not to modify the process >> name structure, keeping the "epoch" (we called it "incarnation") as a >> separate value. Since we aren't terribly concerned about backward >> compatibility, I don't consider this a significant issue - but something the >> community should recognize. >> My main concern will be to ensure that the new code contains enough >> flexibility to allow integration with other layers such as ORCM without >> creating potential conflict over "double protection" - i.e., if the layer >> above ORTE wants to provide a certain level of fault protection, then ORTE >> needs to get out of the way. >> >> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca <bosi...@eecs.utk.edu> wrote: >> >> WHAT: Allow the runtime to handle fail-stop failures for both runtime >> (daemons) or application level processes. This patch extends the >> orte_process_name_t structure with a field to store the process epoch (the >> number of times it died so far), and add an application failure notification >> callback function to be registered in the runtime. >> >> WHY: Necessary to correctly implement the error handling in the MPI 2.2 >> standard. In addition, such a resilient runtime is a cornerstone for any >> level of fault tolerance support we want to provide in the future (such as >> the MPI-3 Run-Through Stabilization or FT-MPI). >> >> WHEN: >> >> WHERE: Patch attached to this email, based on trunk r24747. >> TIMEOUT: 2 weeks from now, on Monday 20 June. >> >> ------ >> >> MORE DETAILS: >> >> Currently the infrastructure required to enable any kind of fault tolerance >> development in Open MPI (with the exception of the checkpoint/restart) is >> missing. However, before developing any fault tolerant support at the >> application (MPI) level, we need to have a resilient runtime. The changes in >> this patch address this lack of support and would allow anyone to implement >> a fault tolerance protocol at the MPI layer without having to worry about >> the ORTE stabilization. >> >> This patch will allow the runtime to drop any dead daemons, and re-route all >> communications around the holes in order to __ALWAYS__ deliver a message as >> long as the destination process is alive. The application is informed (via a >> callback) about the loss of the processes with the same jobid. In this patch >> we do not address the MPI_ERROR_RETURN type of failures, we focused on the >> MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the >> decision, instead of taking it down in the runtime. >> >> NEW STUFF: >> >> Epoch - A counter that tracks the number of times a process has been >> detected to have terminated, either from a failure or an expected >> termination. After the termination is detected, the HNP coordinates all >> other process’s knowledge of the new epoch. Each ORTED will know the epoch >> of the other processes in the job, but it will not actually store anything >> until the epochs change. >> >> Run-Through Stabilization - When an ORTED (or HNP) detects that another >> process has terminated, it repairs the routing layer and informs the HNP. >> The HNP tells all other processes about the failure so they can also repair >> their routing layers an update their internal bookkeeping. The processes do >> not abort after the termination is detected. >> >> Callback Function - When the HNP tells all the ORTEDs about the failures, >> they tell the ORTE layers within the applications. The application level >> ORTE layers have a callback function that they use to inform the OMPI layer >> about the error. Currently the OMPI errhandler code fills in this callback >> function so it is informed when there is an error and it aborts (to maintain >> the current default behavior of MPI). This callback function can also be >> used in an ORTE only application to perform application based fault >> tolerance (ABFT) and allow the application to continue. >> >> NECESSARY FOR IMPLEMENTATION: >> >> Epoch - The orte_process_name_t struct now has a field for epoch. This means >> that whenever sending a message, the most current version of the epoch needs >> to be in this field. This is a simple look up using the function in >> orte/util/nidmap.c: orte_util_lookup_epoch(). In the orte/orted/orted_comm.c >> code, there is a check to make sure that it isn’t trying to send messages to >> a process that has already terminated (don’t send to a process with an epoch >> less than the current epoch). Make sure that if you are sending a message, >> you have the most up to date data here. >> >> Routing - So far, only the binomial routing layer has been updated to use >> the new resilience features. To modify other routing layers to be able to >> continue running after a process failure, they need to be able to detect >> which processes are not currently running and route around them. The errmgr >> gives the routing layer two chances to do this. First it calls delete_route >> for each process that fails, then it calls update_routing_tree after it has >> appropriately marked each process. Before either of these things happen the >> epoch and process state have already been updates so the routing layer can >> use this data to determine which processes are alive and which are dead. A >> convenience function has been added to orte/util/nidmap.h called >> orte_util_proc_is_running() which allows the ORTEDs to determine the status >> of a process. Keep in mind that a process is not running if it hasn’t >> started up yet so it is wise to check the epoch (to make sure that it isn’t >> ORTE_EPOCH_MIN) as well to make sure that you’re actually detecting an error >> and not just noticing that an ORTED hasn’t finished starting. >> >> Callback - If you want to implement some sort of fault tolerance on top of >> this code, use the callback function in the errmgr framework. There is a new >> function in the errmgr code called set_fault_callback that takes a function >> pointer. The ompi_init code sets this to a default value just after it calls >> orte_init (to make sure that there is an errmgr to call into). If you later >> set this to a new function, you will get the callback to notify you of >> process failures. Remember that you’ll need to handle any sort of MPI level >> fault tolerance at this point because you’ve taken away the callback for the >> OMPI layer. >> >> >> >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey