Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Leonardo Fialho Thu, 25 Feb 2010 02:17:55 -0500

Hi Ralph,

Very interesting the "composite framework" idea. Regarding to the schema 
represented by the picture, I didn't understand the RecoS' behaviour in a node 
failure situation.


In this case, will mpirun consider the daemon failure as a normal proc failure? 
If it is correct, should mpirun update the global procs state for all jobs 
running under the failed daemon?

Best regards,
Leonardo

On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote:

> Hi George et al
> 
> I have begun documenting the RecoS operation on the OMPI wiki:
> 
> https://svn.open-mpi.org/trac/ompi/wiki/RecoS
> 
> I'll continue to work on this over the next few days by adding a section 
> explaining what was changed outside of the new framework to make it all work. 
> In addition, I am revising the recos.h API documentation.
> 
> Hope to have all that done over the weekend.
> 
> 
> On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote:
> 
>> 
>> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
>> 
>>> Ralph, Josh,
>>> 
>>> We have some comments about the API of the new framework, mostly 
>>> clarifications needed to better understand how this new framework is 
>>> supposed to be used. And a request for a deadline extension, to delay the 
>>> code merge from the Recos branch in the trunk by a week.
>>> 
>>> We have our own FT branch, with a totally different approach than what is 
>>> described in your RFC. Unfortunately, it diverged from the trunk about a 
>>> year ago, and merging back had proven to be a quite difficult task. Some of 
>>> the functionality in the Recos framework is clearly beneficial for what we 
>>> did, and has the potential to facilitate the porting of most of the 
>>> features from our brach back in trunk. We would like the deadline extension 
>>> in order to deeply analyze the impact of the Recos framework on our work, 
>>> and see how we can fit everything together back in the trunk of Open MPI.
>> 
>> No problem with the extension - feel free to suggest modifications to make 
>> the merge easier. This is by no means cast in stone, but rather a starting 
>> point.
>> 
>>> 
>>> Here are some comments about the code:
>>> 
>>> 1. The documentation in recos.h is not very clear. Most of the functions 
>>> use only IN arguments, and are not supposed to return any values. We don't 
>>> see how the functions are supposed to be used, and what is supposed to be 
>>> their impact on the ORTE framework data.
>> 
>> I'll try to clarify the comments tonight (I know Josh is occupied right 
>> now). The recos APIs are called from two locations:
>> 
>> 1. The errmgr calls recos whenever it receives a report of an aborted 
>> process (via the errmgr.proc_aborted API). The idea was for recos to 
>> determine what (if anything) to do about the failed process. 
>> 
>> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a 
>> list of suggested nodes for the process that is to be restarted. At the 
>> moment, only the resilient mapper module does this. However, Josh and I are 
>> looking at reorganizing some functionality currently in that mapper module 
>> and making all of the existing mappers be "resilient".
>> 
>> So basically, the recos modules determine the recovery procedure and execute 
>> it. For example, in the "orcm" module, we actually update the various 
>> proc/job objects to prep them for restart and call plm.spawn from within 
>> that module. If instead you use the ignore module, it falls through to the 
>> recos base functions which call "abort" to kill the job. Again, the action 
>> is taken local to recos, so nothing need be returned.
>> 
>> The functions generally don't return values (other than success/error) 
>> because we couldn't think of anything useful to return to the errmgr. 
>> Whatever recos does about an aborted proc, the errmgr doesn't do anything 
>> further - if you look in that code, you'll see that if recos is enabled, all 
>> the errmgr does is call recos and return.
>> 
>> Again, this can be changed if desired.
>> 
>>> 
>>> 2. Why do we have all the char***? Why are they only declared as IN 
>>> arguments?
>> 
>> I take it you mean in the predicted fault API? I believe Josh was including 
>> that strictly as a placeholder. As you undoubtedly recall, I removed the 
>> fddp framework from the trunk (devel continues off-line), so Josh wasn't 
>> sure what I might want to input here. If you look at the modules themselves, 
>> you will see the implementation is essentially empty at this time.
>> 
>> We had discussed simply removing that API for now until we determined 
>> if/when fault prediction would return to the OMPI trunk. It was kind of a 
>> tossup - so we left if for now. Could just as easily be removed until a 
>> later date - either way is fine with us.
>> 
>>> 
>>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an 
>>> IN/OUT argument. Why? If the list is modified, then we have a scalability 
>>> problem, as the list will have to be rebuilt before each call.
>> 
>> Looking...looking...hmm.
>> 
>> typedef int (*orte_recos_base_process_fault_fn_t)
>>     (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t 
>> state, int *stack_state);
>> 
>> There is no node list, or list of any type, going in or out of that 
>> function. I suspect you meant the one below it:
>> 
>> typedef int (*orte_recos_base_suggest_map_targets_fn_t)
>>     (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list);
>> 
>> I concur with your concern about scalability here. However, I believe the 
>> idea was that we would pass in the proc that failed and is to be restarted, 
>> a pointer to the node it was last on, and return a list of candidate nodes 
>> where it could be restarted. Essentially, this is the equivalent of building 
>> the target node list that we do in the mappers whenever we map a job.
>> 
>> So in the implementation, we use the rmaps base function to assemble the 
>> target node list for the app, and then go through some logic (e.g., remove 
>> the old node, look at fault groups and load balancing) to prune the list 
>> down. We then pass the resulting list back to the caller.
>> 
>> If we are going to have frequent process failures, then rebuilding the 
>> candidate node list every time would indeed be a problem. I suspect we'll 
>> have to revisit that implementation at some point.
>> 
>> HTH
>> Ralph
>> 
>>> 
>>>  Thanks,
>>>    george.
>>> 
>>> On Feb 19, 2010, at 12:59 , Ralph Castain wrote:
>>> 
>>>> WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk
>>>> 
>>>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault 
>>>> recovery capabilities
>>>> 
>>>> WHERE: Impacts a number of ORTE files and a small number of OMPI files
>>>> 
>>>> TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 
>>>> 27-28
>>>> 
>>>> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/
>>>> 
>>>> ======================================================================
>>>> 
>>>> BACKGROUND:
>>>> 
>>>> Josh and Ralph have been working on a private branch off of the trunk on 
>>>> extended fault recovery procedures, mostly impacting ORTE. The new code 
>>>> optionally allows ORTE to recover from failed nodes, moving processes to 
>>>> other nodes in order to maintain operation. In addition, the code provides 
>>>> better support for recovering from individual process failures.
>>>> 
>>>> Not all of the work done on the private branch will be brought over in 
>>>> this commit. Some of the MPI-specific code that allows recovery from 
>>>> process failure on-the-fly will be committed separately at a later date.
>>>> 
>>>> This commit will include the infrastructure to support those advanced 
>>>> recovery operations. Among other things, this commit will introduce a new 
>>>> "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies 
>>>> for responding to failures. The default module, called "ignore", will 
>>>> stabilize the runtime environment for other RecoS components. In the 
>>>> absence of other RecoS components it will trigger the default behavior 
>>>> (abort the job) to be executed.
>>>> 
>>>> This branch includes some configure modifications that allow a comma 
>>>> separated list of options to be passed to the '--with-ft' option. This 
>>>> allows us to enable any combination of 'cr' and 'recos' at build time, 
>>>> specifically so that the RecoS functionally can be enabled independently 
>>>> of the C/R functionality. Most of the changes outside of the ORTE layer 
>>>> are due to symbol cleanup resulting from this modification.
>>>> 
>>>> For example, C/R specific code paths were previously incorrectly marked 
>>>> with:
>>>> #if OPAL_ENABLE_FT == 1
>>>> They are now marked as, where appropriate:
>>>> #if OPAL_ENABLE_FT_CR == 1
>>>> 
>>>> Additionally, C/R specific components have modified configure.m4 files to 
>>>> change:
>>>> AS_IF([test "$ompi_want_ft" = "0"],
>>>> to:
>>>> AS_IF([test "$ompi_want_ft_cr" = "0"],
>>>> 
>>>> We have created a public repo (reference branch, above) with the code to 
>>>> be merged into the trunk. Please feel free to check it out and test it.
>>>> 
>>>> NOTE: the new recovery capability is only active if...
>>>>  (a) you configure --with-ft=recos, and
>>>>  (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!
>>>> 
>>>> Comments, suggestions, and corrections are welcome!
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Reply via email to