Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun update the global procs state for all jobs running under the failed daemon? Best regards, Leonardo On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote: > Hi George et al > > I have begun documenting the RecoS operation on the OMPI wiki: > > https://svn.open-mpi.org/trac/ompi/wiki/RecoS > > I'll continue to work on this over the next few days by adding a section > explaining what was changed outside of the new framework to make it all work. > In addition, I am revising the recos.h API documentation. > > Hope to have all that done over the weekend. > > > On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote: > >> >> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: >> >>> Ralph, Josh, >>> >>> We have some comments about the API of the new framework, mostly >>> clarifications needed to better understand how this new framework is >>> supposed to be used. And a request for a deadline extension, to delay the >>> code merge from the Recos branch in the trunk by a week. >>> >>> We have our own FT branch, with a totally different approach than what is >>> described in your RFC. Unfortunately, it diverged from the trunk about a >>> year ago, and merging back had proven to be a quite difficult task. Some of >>> the functionality in the Recos framework is clearly beneficial for what we >>> did, and has the potential to facilitate the porting of most of the >>> features from our brach back in trunk. We would like the deadline extension >>> in order to deeply analyze the impact of the Recos framework on our work, >>> and see how we can fit everything together back in the trunk of Open MPI. >> >> No problem with the extension - feel free to suggest modifications to make >> the merge easier. This is by no means cast in stone, but rather a starting >> point. >> >>> >>> Here are some comments about the code: >>> >>> 1. The documentation in recos.h is not very clear. Most of the functions >>> use only IN arguments, and are not supposed to return any values. We don't >>> see how the functions are supposed to be used, and what is supposed to be >>> their impact on the ORTE framework data. >> >> I'll try to clarify the comments tonight (I know Josh is occupied right >> now). The recos APIs are called from two locations: >> >> 1. The errmgr calls recos whenever it receives a report of an aborted >> process (via the errmgr.proc_aborted API). The idea was for recos to >> determine what (if anything) to do about the failed process. >> >> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a >> list of suggested nodes for the process that is to be restarted. At the >> moment, only the resilient mapper module does this. However, Josh and I are >> looking at reorganizing some functionality currently in that mapper module >> and making all of the existing mappers be "resilient". >> >> So basically, the recos modules determine the recovery procedure and execute >> it. For example, in the "orcm" module, we actually update the various >> proc/job objects to prep them for restart and call plm.spawn from within >> that module. If instead you use the ignore module, it falls through to the >> recos base functions which call "abort" to kill the job. Again, the action >> is taken local to recos, so nothing need be returned. >> >> The functions generally don't return values (other than success/error) >> because we couldn't think of anything useful to return to the errmgr. >> Whatever recos does about an aborted proc, the errmgr doesn't do anything >> further - if you look in that code, you'll see that if recos is enabled, all >> the errmgr does is call recos and return. >> >> Again, this can be changed if desired. >> >>> >>> 2. Why do we have all the char***? Why are they only declared as IN >>> arguments? >> >> I take it you mean in the predicted fault API? I believe Josh was including >> that strictly as a placeholder. As you undoubtedly recall, I removed the >> fddp framework from the trunk (devel continues off-line), so Josh wasn't >> sure what I might want to input here. If you look at the modules themselves, >> you will see the implementation is essentially empty at this time. >> >> We had discussed simply removing that API for now until we determined >> if/when fault prediction would return to the OMPI trunk. It was kind of a >> tossup - so we left if for now. Could just as easily be removed until a >> later date - either way is fine with us. >> >>> >>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an >>> IN/OUT argument. Why? If the list is modified, then we have a scalability >>> problem, as the list will have to be rebuilt before each call. >> >> Looking...looking...hmm. >> >> typedef int (*orte_recos_base_process_fault_fn_t) >> (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t >> state, int *stack_state); >> >> There is no node list, or list of any type, going in or out of that >> function. I suspect you meant the one below it: >> >> typedef int (*orte_recos_base_suggest_map_targets_fn_t) >> (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list); >> >> I concur with your concern about scalability here. However, I believe the >> idea was that we would pass in the proc that failed and is to be restarted, >> a pointer to the node it was last on, and return a list of candidate nodes >> where it could be restarted. Essentially, this is the equivalent of building >> the target node list that we do in the mappers whenever we map a job. >> >> So in the implementation, we use the rmaps base function to assemble the >> target node list for the app, and then go through some logic (e.g., remove >> the old node, look at fault groups and load balancing) to prune the list >> down. We then pass the resulting list back to the caller. >> >> If we are going to have frequent process failures, then rebuilding the >> candidate node list every time would indeed be a problem. I suspect we'll >> have to revisit that implementation at some point. >> >> HTH >> Ralph >> >>> >>> Thanks, >>> george. >>> >>> On Feb 19, 2010, at 12:59 , Ralph Castain wrote: >>> >>>> WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk >>>> >>>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault >>>> recovery capabilities >>>> >>>> WHERE: Impacts a number of ORTE files and a small number of OMPI files >>>> >>>> TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb >>>> 27-28 >>>> >>>> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/ >>>> >>>> ====================================================================== >>>> >>>> BACKGROUND: >>>> >>>> Josh and Ralph have been working on a private branch off of the trunk on >>>> extended fault recovery procedures, mostly impacting ORTE. The new code >>>> optionally allows ORTE to recover from failed nodes, moving processes to >>>> other nodes in order to maintain operation. In addition, the code provides >>>> better support for recovering from individual process failures. >>>> >>>> Not all of the work done on the private branch will be brought over in >>>> this commit. Some of the MPI-specific code that allows recovery from >>>> process failure on-the-fly will be committed separately at a later date. >>>> >>>> This commit will include the infrastructure to support those advanced >>>> recovery operations. Among other things, this commit will introduce a new >>>> "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies >>>> for responding to failures. The default module, called "ignore", will >>>> stabilize the runtime environment for other RecoS components. In the >>>> absence of other RecoS components it will trigger the default behavior >>>> (abort the job) to be executed. >>>> >>>> This branch includes some configure modifications that allow a comma >>>> separated list of options to be passed to the '--with-ft' option. This >>>> allows us to enable any combination of 'cr' and 'recos' at build time, >>>> specifically so that the RecoS functionally can be enabled independently >>>> of the C/R functionality. Most of the changes outside of the ORTE layer >>>> are due to symbol cleanup resulting from this modification. >>>> >>>> For example, C/R specific code paths were previously incorrectly marked >>>> with: >>>> #if OPAL_ENABLE_FT == 1 >>>> They are now marked as, where appropriate: >>>> #if OPAL_ENABLE_FT_CR == 1 >>>> >>>> Additionally, C/R specific components have modified configure.m4 files to >>>> change: >>>> AS_IF([test "$ompi_want_ft" = "0"], >>>> to: >>>> AS_IF([test "$ompi_want_ft_cr" = "0"], >>>> >>>> We have created a public repo (reference branch, above) with the code to >>>> be merged into the trunk. Please feel free to check it out and test it. >>>> >>>> NOTE: the new recovery capability is only active if... >>>> (a) you configure --with-ft=recos, and >>>> (b) you set OMPI_MCA_recos_base_enable=1 to turn it on! >>>> >>>> Comments, suggestions, and corrections are welcome! >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> [email protected] >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> [email protected] >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > _______________________________________________ > devel mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/devel
