Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Ralph Castain Thu, 25 Feb 2010 07:38:18 -0500

On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote:

> Hi Ralph,
> 
> Very interesting the "composite framework" idea.


Josh is the force behind that idea :-)

> Regarding to the schema represented by the picture, I didn't understand the 
> RecoS' behaviour in a node failure situation.
> 
> In this case, will mpirun consider the daemon failure as a normal proc 
> failure? If it is correct, should mpirun update the global procs state for 
> all jobs running under the failed daemon?

I haven't included the node failure case yet - still on my "to-do" list. In 
brief, the answer is yes/no. :-)

Daemon failure follows the same code path as shown in the flow chart. However, 
it is up to the individual modules to determine a response to that failure. The 
"orcm" RecoS module response is to (a) mark all procs on that node as having 
failed, (b) mark that node as "down" so it won't get reused, and (c) remap and 
restart all such procs on the remaining available nodes, starting new daemon(s) 
as required.

In the orcm environment, nodes that are replaced or rebooted automatically 
start their own daemon. This is detected by orcm, and the node state (if the 
node is rebooted) will automatically be updated to "up" - if it is a new node, 
it is automatically added to the available resources. This allows the node to 
be reused once the problem has been corrected. In other environments (ssh, 
slurm, etc), the node is simply left as "down" as there is no way to know 
if/when the node becomes available again.

If you aren't using the "orcm" module, then the default behavior will abort the 
job.


> 
> Best regards,
> Leonardo
> 
> On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote:
> 
>> Hi George et al
>> 
>> I have begun documenting the RecoS operation on the OMPI wiki:
>> 
>> https://svn.open-mpi.org/trac/ompi/wiki/RecoS
>> 
>> I'll continue to work on this over the next few days by adding a section 
>> explaining what was changed outside of the new framework to make it all 
>> work. In addition, I am revising the recos.h API documentation.
>> 
>> Hope to have all that done over the weekend.
>> 
>> 
>> On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote:
>> 
>>> 
>>> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
>>> 
>>>> Ralph, Josh,
>>>> 
>>>> We have some comments about the API of the new framework, mostly 
>>>> clarifications needed to better understand how this new framework is 
>>>> supposed to be used. And a request for a deadline extension, to delay the 
>>>> code merge from the Recos branch in the trunk by a week.
>>>> 
>>>> We have our own FT branch, with a totally different approach than what is 
>>>> described in your RFC. Unfortunately, it diverged from the trunk about a 
>>>> year ago, and merging back had proven to be a quite difficult task. Some 
>>>> of the functionality in the Recos framework is clearly beneficial for what 
>>>> we did, and has the potential to facilitate the porting of most of the 
>>>> features from our brach back in trunk. We would like the deadline 
>>>> extension in order to deeply analyze the impact of the Recos framework on 
>>>> our work, and see how we can fit everything together back in the trunk of 
>>>> Open MPI.
>>> 
>>> No problem with the extension - feel free to suggest modifications to make 
>>> the merge easier. This is by no means cast in stone, but rather a starting 
>>> point.
>>> 
>>>> 
>>>> Here are some comments about the code:
>>>> 
>>>> 1. The documentation in recos.h is not very clear. Most of the functions 
>>>> use only IN arguments, and are not supposed to return any values. We don't 
>>>> see how the functions are supposed to be used, and what is supposed to be 
>>>> their impact on the ORTE framework data.
>>> 
>>> I'll try to clarify the comments tonight (I know Josh is occupied right 
>>> now). The recos APIs are called from two locations:
>>> 
>>> 1. The errmgr calls recos whenever it receives a report of an aborted 
>>> process (via the errmgr.proc_aborted API). The idea was for recos to 
>>> determine what (if anything) to do about the failed process. 
>>> 
>>> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a 
>>> list of suggested nodes for the process that is to be restarted. At the 
>>> moment, only the resilient mapper module does this. However, Josh and I are 
>>> looking at reorganizing some functionality currently in that mapper module 
>>> and making all of the existing mappers be "resilient".
>>> 
>>> So basically, the recos modules determine the recovery procedure and 
>>> execute it. For example, in the "orcm" module, we actually update the 
>>> various proc/job objects to prep them for restart and call plm.spawn from 
>>> within that module. If instead you use the ignore module, it falls through 
>>> to the recos base functions which call "abort" to kill the job. Again, the 
>>> action is taken local to recos, so nothing need be returned.
>>> 
>>> The functions generally don't return values (other than success/error) 
>>> because we couldn't think of anything useful to return to the errmgr. 
>>> Whatever recos does about an aborted proc, the errmgr doesn't do anything 
>>> further - if you look in that code, you'll see that if recos is enabled, 
>>> all the errmgr does is call recos and return.
>>> 
>>> Again, this can be changed if desired.
>>> 
>>>> 
>>>> 2. Why do we have all the char***? Why are they only declared as IN 
>>>> arguments?
>>> 
>>> I take it you mean in the predicted fault API? I believe Josh was including 
>>> that strictly as a placeholder. As you undoubtedly recall, I removed the 
>>> fddp framework from the trunk (devel continues off-line), so Josh wasn't 
>>> sure what I might want to input here. If you look at the modules 
>>> themselves, you will see the implementation is essentially empty at this 
>>> time.
>>> 
>>> We had discussed simply removing that API for now until we determined 
>>> if/when fault prediction would return to the OMPI trunk. It was kind of a 
>>> tossup - so we left if for now. Could just as easily be removed until a 
>>> later date - either way is fine with us.
>>> 
>>>> 
>>>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an 
>>>> IN/OUT argument. Why? If the list is modified, then we have a scalability 
>>>> problem, as the list will have to be rebuilt before each call.
>>> 
>>> Looking...looking...hmm.
>>> 
>>> typedef int (*orte_recos_base_process_fault_fn_t)
>>>     (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t 
>>> state, int *stack_state);
>>> 
>>> There is no node list, or list of any type, going in or out of that 
>>> function. I suspect you meant the one below it:
>>> 
>>> typedef int (*orte_recos_base_suggest_map_targets_fn_t)
>>>     (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list);
>>> 
>>> I concur with your concern about scalability here. However, I believe the 
>>> idea was that we would pass in the proc that failed and is to be restarted, 
>>> a pointer to the node it was last on, and return a list of candidate nodes 
>>> where it could be restarted. Essentially, this is the equivalent of 
>>> building the target node list that we do in the mappers whenever we map a 
>>> job.
>>> 
>>> So in the implementation, we use the rmaps base function to assemble the 
>>> target node list for the app, and then go through some logic (e.g., remove 
>>> the old node, look at fault groups and load balancing) to prune the list 
>>> down. We then pass the resulting list back to the caller.
>>> 
>>> If we are going to have frequent process failures, then rebuilding the 
>>> candidate node list every time would indeed be a problem. I suspect we'll 
>>> have to revisit that implementation at some point.
>>> 
>>> HTH
>>> Ralph
>>> 
>>>> 
>>>>  Thanks,
>>>>    george.
>>>> 
>>>> On Feb 19, 2010, at 12:59 , Ralph Castain wrote:
>>>> 
>>>>> WHAT: Merge a tmp branch for fault recovery development into the OMPI 
>>>>> trunk
>>>>> 
>>>>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault 
>>>>> recovery capabilities
>>>>> 
>>>>> WHERE: Impacts a number of ORTE files and a small number of OMPI files
>>>>> 
>>>>> TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 
>>>>> 27-28
>>>>> 
>>>>> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/
>>>>> 
>>>>> ======================================================================
>>>>> 
>>>>> BACKGROUND:
>>>>> 
>>>>> Josh and Ralph have been working on a private branch off of the trunk on 
>>>>> extended fault recovery procedures, mostly impacting ORTE. The new code 
>>>>> optionally allows ORTE to recover from failed nodes, moving processes to 
>>>>> other nodes in order to maintain operation. In addition, the code 
>>>>> provides better support for recovering from individual process failures.
>>>>> 
>>>>> Not all of the work done on the private branch will be brought over in 
>>>>> this commit. Some of the MPI-specific code that allows recovery from 
>>>>> process failure on-the-fly will be committed separately at a later date.
>>>>> 
>>>>> This commit will include the infrastructure to support those advanced 
>>>>> recovery operations. Among other things, this commit will introduce a new 
>>>>> "RecoS" (Recovery Service/Strategy) framework to allow multiple 
>>>>> strategies for responding to failures. The default module, called 
>>>>> "ignore", will stabilize the runtime environment for other RecoS 
>>>>> components. In the absence of other RecoS components it will trigger the 
>>>>> default behavior (abort the job) to be executed.
>>>>> 
>>>>> This branch includes some configure modifications that allow a comma 
>>>>> separated list of options to be passed to the '--with-ft' option. This 
>>>>> allows us to enable any combination of 'cr' and 'recos' at build time, 
>>>>> specifically so that the RecoS functionally can be enabled independently 
>>>>> of the C/R functionality. Most of the changes outside of the ORTE layer 
>>>>> are due to symbol cleanup resulting from this modification.
>>>>> 
>>>>> For example, C/R specific code paths were previously incorrectly marked 
>>>>> with:
>>>>> #if OPAL_ENABLE_FT == 1
>>>>> They are now marked as, where appropriate:
>>>>> #if OPAL_ENABLE_FT_CR == 1
>>>>> 
>>>>> Additionally, C/R specific components have modified configure.m4 files to 
>>>>> change:
>>>>> AS_IF([test "$ompi_want_ft" = "0"],
>>>>> to:
>>>>> AS_IF([test "$ompi_want_ft_cr" = "0"],
>>>>> 
>>>>> We have created a public repo (reference branch, above) with the code to 
>>>>> be merged into the trunk. Please feel free to check it out and test it.
>>>>> 
>>>>> NOTE: the new recovery capability is only active if...
>>>>>  (a) you configure --with-ft=recos, and
>>>>>  (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!
>>>>> 
>>>>> Comments, suggestions, and corrections are welcome!
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Reply via email to