Re: [OMPI devel] RFC: Resilient ORTE

Josh Hursey Tue, 7 Jun 2011 16:55:57 -0400

I looked through the patch a bit more today and had a few notes/questions.
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
- in the orte_errmgr.set_fault_callback: it would be nice if it
returned the previous callback, so you could layer more than one
'thing' on top of ORTE and have them chain in a sigaction-like manner.
- orte_process_info.max_procs: this seems to be only used in the
binomial routed, but I was a bit unclear about its purpose. Can you
describe what it does, and how it is used?
- in orted_comm.c: you process the ORTE_PROCESS_FAILED_NOTIFICATION
message here. Why not push all of that logic into the errmgr
components? It is not a big deal, just curious.


I'll probably send more notes after some more digging and testing of
the code. But the patch is looking good. Good work!

-- Josh

On Tue, Jun 7, 2011 at 10:51 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:
> I briefly looked over the patch. Excluding the epochs (which we don't
> need now, but will soon) it looks similar to what I have setup on my
> MPI run-through stabilization branch - so it should support that work
> nicely. I'll try to test it this week and send back any other
> comments.
>
> Good work.
>
> Thanks,
> Josh
>
> On Tue, Jun 7, 2011 at 10:46 AM, Wesley Bland <wbl...@eecs.utk.edu> wrote:
>> This could certainly work alongside another ORCM or any other fault
>> detection/prediction/recovery mechanism. Most of the code is just dedicated
>> to keeping the epoch up to date and tracking the status of the processes.
>> The underlying idea was to provide a way for the application to decide what
>> its fault policy would be rather than trying to dictate one in the runtime.
>> If any other layer wanted to register a callback function with this code, it
>> could do anything it wanted to on top of it.
>> Wesley
>>
>> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
>>
>> I'm on travel this week, but will look this over when I return. From the
>> description, it sounds nearly identical to what we did in ORCM, so I expect
>> there won't be many issues. You do get some race conditions that the new
>> state machine code should help resolve.
>> Only difference I can quickly see is that we chose not to modify the process
>> name structure, keeping the "epoch" (we called it "incarnation") as a
>> separate value. Since we aren't terribly concerned about backward
>> compatibility, I don't consider this a significant issue - but something the
>> community should recognize.
>> My main concern will be to ensure that the new code contains enough
>> flexibility to allow integration with other layers such as ORCM without
>> creating potential conflict over "double protection" - i.e., if the layer
>> above ORTE wants to provide a certain level of fault protection, then ORTE
>> needs to get out of the way.
>>
>> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca <bosi...@eecs.utk.edu> wrote:
>>
>> WHAT: Allow the runtime to handle fail-stop failures for both runtime
>> (daemons) or application level processes. This patch extends the
>> orte_process_name_t structure with a field to store the process epoch (the
>> number of times it died so far), and add an application failure notification
>> callback function to be registered in the runtime.
>>
>> WHY: Necessary to correctly implement the error handling in the MPI 2.2
>> standard. In addition, such a resilient runtime is a cornerstone for any
>> level of fault tolerance support we want to provide in the future (such as
>> the MPI-3 Run-Through Stabilization or FT-MPI).
>>
>> WHEN:
>>
>> WHERE: Patch attached to this email, based on trunk r24747.
>> TIMEOUT: 2 weeks from now, on Monday 20 June.
>>
>> ------
>>
>> MORE DETAILS:
>>
>> Currently the infrastructure required to enable any kind of fault tolerance
>> development in Open MPI (with the exception of the checkpoint/restart) is
>> missing. However, before developing any fault tolerant support at the
>> application (MPI) level, we need to have a resilient runtime. The changes in
>> this patch address this lack of support and would allow anyone to implement
>> a fault tolerance protocol at the MPI layer without having to worry about
>> the ORTE stabilization.
>>
>> This patch will allow the runtime to drop any dead daemons, and re-route all
>> communications around the holes in order to __ALWAYS__ deliver a message as
>> long as the destination process is alive. The application is informed (via a
>> callback) about the loss of the processes with the same jobid. In this patch
>> we do not address the MPI_ERROR_RETURN type of failures, we focused on the
>> MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the
>> decision, instead of taking it down in the runtime.
>>
>> NEW STUFF:
>>
>> Epoch - A counter that tracks the number of times a process has been
>> detected to have terminated, either from a failure or an expected
>> termination. After the termination is detected, the HNP coordinates all
>> other process’s knowledge of the new epoch. Each ORTED will know the epoch
>> of the other processes in the job, but it will not actually store anything
>> until the epochs change.
>>
>> Run-Through Stabilization - When an ORTED (or HNP) detects that another
>> process has terminated, it repairs the routing layer and informs the HNP.
>> The HNP tells all other processes about the failure so they can also repair
>> their routing layers an update their internal bookkeeping. The processes do
>> not abort after the termination is detected.
>>
>> Callback Function - When the HNP tells all the ORTEDs about the failures,
>> they tell the ORTE layers within the applications. The application level
>> ORTE layers have a callback function that they use to inform the OMPI layer
>> about the error. Currently the OMPI errhandler code fills in this callback
>> function so it is informed when there is an error and it aborts (to maintain
>> the current default behavior of MPI). This callback function can also be
>> used in an ORTE only application to perform application based fault
>> tolerance (ABFT) and allow the application to continue.
>>
>> NECESSARY FOR IMPLEMENTATION:
>>
>> Epoch - The orte_process_name_t struct now has a field for epoch. This means
>> that whenever sending a message, the most current version of the epoch needs
>> to be in this field. This is a simple look up using the function in
>> orte/util/nidmap.c: orte_util_lookup_epoch(). In the orte/orted/orted_comm.c
>> code, there is a check to make sure that it isn’t trying to send messages to
>> a process that has already terminated (don’t send to a process with an epoch
>> less than the current epoch). Make sure that if you are sending a message,
>> you have the most up to date data here.
>>
>> Routing - So far, only the binomial routing layer has been updated to use
>> the new resilience features. To modify other routing layers to be able to
>> continue running after a process failure, they need to be able to detect
>> which processes are not currently running and route around them. The errmgr
>> gives the routing layer two chances to do this. First it calls delete_route
>> for each process that fails, then it calls update_routing_tree after it has
>> appropriately marked each process. Before either of these things happen the
>> epoch and process state have already been updates so the routing layer can
>> use this data to determine which processes are alive and which are dead. A
>> convenience function has been added to orte/util/nidmap.h called
>> orte_util_proc_is_running() which allows the ORTEDs to determine the status
>> of a process. Keep in mind that a process is not running if it hasn’t
>> started up yet so it is wise to check the epoch (to make sure that it isn’t
>> ORTE_EPOCH_MIN) as well to make sure that you’re actually detecting an error
>> and not just noticing that an ORTED hasn’t finished starting.
>>
>> Callback - If you want to implement some sort of fault tolerance on top of
>> this code, use the callback function in the errmgr framework. There is a new
>> function in the errmgr code called set_fault_callback that takes a function
>> pointer. The ompi_init code sets this to a default value just after it calls
>> orte_init (to make sure that there is an errmgr to call into). If you later
>> set this to a new function, you will get the callback to notify you of
>> process failures. Remember that you’ll need to handle any sort of MPI level
>> fault tolerance at this point because you’ve taken away the callback for the
>> OMPI layer.
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Re: [OMPI devel] RFC: Resilient ORTE

Reply via email to