Hi George et al I have begun documenting the RecoS operation on the OMPI wiki:
https://svn.open-mpi.org/trac/ompi/wiki/RecoS I'll continue to work on this over the next few days by adding a section explaining what was changed outside of the new framework to make it all work. In addition, I am revising the recos.h API documentation. Hope to have all that done over the weekend. On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote: > > On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: > >> Ralph, Josh, >> >> We have some comments about the API of the new framework, mostly >> clarifications needed to better understand how this new framework is >> supposed to be used. And a request for a deadline extension, to delay the >> code merge from the Recos branch in the trunk by a week. >> >> We have our own FT branch, with a totally different approach than what is >> described in your RFC. Unfortunately, it diverged from the trunk about a >> year ago, and merging back had proven to be a quite difficult task. Some of >> the functionality in the Recos framework is clearly beneficial for what we >> did, and has the potential to facilitate the porting of most of the features >> from our brach back in trunk. We would like the deadline extension in order >> to deeply analyze the impact of the Recos framework on our work, and see how >> we can fit everything together back in the trunk of Open MPI. > > No problem with the extension - feel free to suggest modifications to make > the merge easier. This is by no means cast in stone, but rather a starting > point. > >> >> Here are some comments about the code: >> >> 1. The documentation in recos.h is not very clear. Most of the functions use >> only IN arguments, and are not supposed to return any values. We don't see >> how the functions are supposed to be used, and what is supposed to be their >> impact on the ORTE framework data. > > I'll try to clarify the comments tonight (I know Josh is occupied right now). > The recos APIs are called from two locations: > > 1. The errmgr calls recos whenever it receives a report of an aborted process > (via the errmgr.proc_aborted API). The idea was for recos to determine what > (if anything) to do about the failed process. > > 2. The rmaps modules can call the recos "suggest_map_targets" API to get a > list of suggested nodes for the process that is to be restarted. At the > moment, only the resilient mapper module does this. However, Josh and I are > looking at reorganizing some functionality currently in that mapper module > and making all of the existing mappers be "resilient". > > So basically, the recos modules determine the recovery procedure and execute > it. For example, in the "orcm" module, we actually update the various > proc/job objects to prep them for restart and call plm.spawn from within that > module. If instead you use the ignore module, it falls through to the recos > base functions which call "abort" to kill the job. Again, the action is taken > local to recos, so nothing need be returned. > > The functions generally don't return values (other than success/error) > because we couldn't think of anything useful to return to the errmgr. > Whatever recos does about an aborted proc, the errmgr doesn't do anything > further - if you look in that code, you'll see that if recos is enabled, all > the errmgr does is call recos and return. > > Again, this can be changed if desired. > >> >> 2. Why do we have all the char***? Why are they only declared as IN >> arguments? > > I take it you mean in the predicted fault API? I believe Josh was including > that strictly as a placeholder. As you undoubtedly recall, I removed the fddp > framework from the trunk (devel continues off-line), so Josh wasn't sure what > I might want to input here. If you look at the modules themselves, you will > see the implementation is essentially empty at this time. > > We had discussed simply removing that API for now until we determined if/when > fault prediction would return to the OMPI trunk. It was kind of a tossup - so > we left if for now. Could just as easily be removed until a later date - > either way is fine with us. > >> >> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an >> IN/OUT argument. Why? If the list is modified, then we have a scalability >> problem, as the list will have to be rebuilt before each call. > > Looking...looking...hmm. > > typedef int (*orte_recos_base_process_fault_fn_t) > (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t > state, int *stack_state); > > There is no node list, or list of any type, going in or out of that function. > I suspect you meant the one below it: > > typedef int (*orte_recos_base_suggest_map_targets_fn_t) > (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list); > > I concur with your concern about scalability here. However, I believe the > idea was that we would pass in the proc that failed and is to be restarted, a > pointer to the node it was last on, and return a list of candidate nodes > where it could be restarted. Essentially, this is the equivalent of building > the target node list that we do in the mappers whenever we map a job. > > So in the implementation, we use the rmaps base function to assemble the > target node list for the app, and then go through some logic (e.g., remove > the old node, look at fault groups and load balancing) to prune the list > down. We then pass the resulting list back to the caller. > > If we are going to have frequent process failures, then rebuilding the > candidate node list every time would indeed be a problem. I suspect we'll > have to revisit that implementation at some point. > > HTH > Ralph > >> >> Thanks, >> george. >> >> On Feb 19, 2010, at 12:59 , Ralph Castain wrote: >> >>> WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk >>> >>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery >>> capabilities >>> >>> WHERE: Impacts a number of ORTE files and a small number of OMPI files >>> >>> TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb >>> 27-28 >>> >>> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/ >>> >>> ====================================================================== >>> >>> BACKGROUND: >>> >>> Josh and Ralph have been working on a private branch off of the trunk on >>> extended fault recovery procedures, mostly impacting ORTE. The new code >>> optionally allows ORTE to recover from failed nodes, moving processes to >>> other nodes in order to maintain operation. In addition, the code provides >>> better support for recovering from individual process failures. >>> >>> Not all of the work done on the private branch will be brought over in this >>> commit. Some of the MPI-specific code that allows recovery from process >>> failure on-the-fly will be committed separately at a later date. >>> >>> This commit will include the infrastructure to support those advanced >>> recovery operations. Among other things, this commit will introduce a new >>> "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies >>> for responding to failures. The default module, called "ignore", will >>> stabilize the runtime environment for other RecoS components. In the >>> absence of other RecoS components it will trigger the default behavior >>> (abort the job) to be executed. >>> >>> This branch includes some configure modifications that allow a comma >>> separated list of options to be passed to the '--with-ft' option. This >>> allows us to enable any combination of 'cr' and 'recos' at build time, >>> specifically so that the RecoS functionally can be enabled independently of >>> the C/R functionality. Most of the changes outside of the ORTE layer are >>> due to symbol cleanup resulting from this modification. >>> >>> For example, C/R specific code paths were previously incorrectly marked >>> with: >>> #if OPAL_ENABLE_FT == 1 >>> They are now marked as, where appropriate: >>> #if OPAL_ENABLE_FT_CR == 1 >>> >>> Additionally, C/R specific components have modified configure.m4 files to >>> change: >>> AS_IF([test "$ompi_want_ft" = "0"], >>> to: >>> AS_IF([test "$ompi_want_ft_cr" = "0"], >>> >>> We have created a public repo (reference branch, above) with the code to be >>> merged into the trunk. Please feel free to check it out and test it. >>> >>> NOTE: the new recovery capability is only active if... >>> (a) you configure --with-ft=recos, and >>> (b) you set OMPI_MCA_recos_base_enable=1 to turn it on! >>> >>> Comments, suggestions, and corrections are welcome! >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >