On Feb 25, 2010, at 4:38 AM, Ralph Castain wrote: > > On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote: > >> Hi Ralph, >> >> Very interesting the "composite framework" idea. > > Josh is the force behind that idea :-)
It solves a pretty interesting little problem. Its utility will really shine when I move the new components into place in the coming weeks/month. > >> Regarding to the schema represented by the picture, I didn't understand the >> RecoS' behaviour in a node failure situation. >> >> In this case, will mpirun consider the daemon failure as a normal proc >> failure? If it is correct, should mpirun update the global procs state for >> all jobs running under the failed daemon? > > I haven't included the node failure case yet - still on my "to-do" list. In > brief, the answer is yes/no. :-) > > Daemon failure follows the same code path as shown in the flow chart. > However, it is up to the individual modules to determine a response to that > failure. The "orcm" RecoS module response is to (a) mark all procs on that > node as having failed, (b) mark that node as "down" so it won't get reused, > and (c) remap and restart all such procs on the remaining available nodes, > starting new daemon(s) as required. > > In the orcm environment, nodes that are replaced or rebooted automatically > start their own daemon. This is detected by orcm, and the node state (if the > node is rebooted) will automatically be updated to "up" - if it is a new > node, it is automatically added to the available resources. This allows the > node to be reused once the problem has been corrected. In other environments > (ssh, slurm, etc), the node is simply left as "down" as there is no way to > know if/when the node becomes available again. > > If you aren't using the "orcm" module, then the default behavior will abort > the job. Just to echo this response. The orted and process failures use the same error path, but can be easily differentiated by their jobids. The 'orcm' component is a good example of differentiating these two fault scenarios to correctly recover the ORTE job. Soon we may/should/will have the same ability with certain MPI jobs. :) -- Josh > > >> >> Best regards, >> Leonardo >> >> On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote: >> >>> Hi George et al >>> >>> I have begun documenting the RecoS operation on the OMPI wiki: >>> >>> https://svn.open-mpi.org/trac/ompi/wiki/RecoS >>> >>> I'll continue to work on this over the next few days by adding a section >>> explaining what was changed outside of the new framework to make it all >>> work. In addition, I am revising the recos.h API documentation. >>> >>> Hope to have all that done over the weekend. >>> >>> >>> On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote: >>> >>>> >>>> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: >>>> >>>>> Ralph, Josh, >>>>> >>>>> We have some comments about the API of the new framework, mostly >>>>> clarifications needed to better understand how this new framework is >>>>> supposed to be used. And a request for a deadline extension, to delay the >>>>> code merge from the Recos branch in the trunk by a week. >>>>> >>>>> We have our own FT branch, with a totally different approach than what is >>>>> described in your RFC. Unfortunately, it diverged from the trunk about a >>>>> year ago, and merging back had proven to be a quite difficult task. Some >>>>> of the functionality in the Recos framework is clearly beneficial for >>>>> what we did, and has the potential to facilitate the porting of most of >>>>> the features from our brach back in trunk. We would like the deadline >>>>> extension in order to deeply analyze the impact of the Recos framework on >>>>> our work, and see how we can fit everything together back in the trunk of >>>>> Open MPI. >>>> >>>> No problem with the extension - feel free to suggest modifications to make >>>> the merge easier. This is by no means cast in stone, but rather a starting >>>> point. >>>> >>>>> >>>>> Here are some comments about the code: >>>>> >>>>> 1. The documentation in recos.h is not very clear. Most of the functions >>>>> use only IN arguments, and are not supposed to return any values. We >>>>> don't see how the functions are supposed to be used, and what is supposed >>>>> to be their impact on the ORTE framework data. >>>> >>>> I'll try to clarify the comments tonight (I know Josh is occupied right >>>> now). The recos APIs are called from two locations: >>>> >>>> 1. The errmgr calls recos whenever it receives a report of an aborted >>>> process (via the errmgr.proc_aborted API). The idea was for recos to >>>> determine what (if anything) to do about the failed process. >>>> >>>> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a >>>> list of suggested nodes for the process that is to be restarted. At the >>>> moment, only the resilient mapper module does this. However, Josh and I >>>> are looking at reorganizing some functionality currently in that mapper >>>> module and making all of the existing mappers be "resilient". >>>> >>>> So basically, the recos modules determine the recovery procedure and >>>> execute it. For example, in the "orcm" module, we actually update the >>>> various proc/job objects to prep them for restart and call plm.spawn from >>>> within that module. If instead you use the ignore module, it falls through >>>> to the recos base functions which call "abort" to kill the job. Again, the >>>> action is taken local to recos, so nothing need be returned. >>>> >>>> The functions generally don't return values (other than success/error) >>>> because we couldn't think of anything useful to return to the errmgr. >>>> Whatever recos does about an aborted proc, the errmgr doesn't do anything >>>> further - if you look in that code, you'll see that if recos is enabled, >>>> all the errmgr does is call recos and return. >>>> >>>> Again, this can be changed if desired. >>>> >>>>> >>>>> 2. Why do we have all the char***? Why are they only declared as IN >>>>> arguments? >>>> >>>> I take it you mean in the predicted fault API? I believe Josh was >>>> including that strictly as a placeholder. As you undoubtedly recall, I >>>> removed the fddp framework from the trunk (devel continues off-line), so >>>> Josh wasn't sure what I might want to input here. If you look at the >>>> modules themselves, you will see the implementation is essentially empty >>>> at this time. >>>> >>>> We had discussed simply removing that API for now until we determined >>>> if/when fault prediction would return to the OMPI trunk. It was kind of a >>>> tossup - so we left if for now. Could just as easily be removed until a >>>> later date - either way is fine with us. >>>> >>>>> >>>>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as >>>>> an IN/OUT argument. Why? If the list is modified, then we have a >>>>> scalability problem, as the list will have to be rebuilt before each call. >>>> >>>> Looking...looking...hmm. >>>> >>>> typedef int (*orte_recos_base_process_fault_fn_t) >>>> (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t >>>> state, int *stack_state); >>>> >>>> There is no node list, or list of any type, going in or out of that >>>> function. I suspect you meant the one below it: >>>> >>>> typedef int (*orte_recos_base_suggest_map_targets_fn_t) >>>> (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list); >>>> >>>> I concur with your concern about scalability here. However, I believe the >>>> idea was that we would pass in the proc that failed and is to be >>>> restarted, a pointer to the node it was last on, and return a list of >>>> candidate nodes where it could be restarted. Essentially, this is the >>>> equivalent of building the target node list that we do in the mappers >>>> whenever we map a job. >>>> >>>> So in the implementation, we use the rmaps base function to assemble the >>>> target node list for the app, and then go through some logic (e.g., remove >>>> the old node, look at fault groups and load balancing) to prune the list >>>> down. We then pass the resulting list back to the caller. >>>> >>>> If we are going to have frequent process failures, then rebuilding the >>>> candidate node list every time would indeed be a problem. I suspect we'll >>>> have to revisit that implementation at some point. >>>> >>>> HTH >>>> Ralph >>>> >>>>> >>>>> Thanks, >>>>> george. >>>>> >>>>> On Feb 19, 2010, at 12:59 , Ralph Castain wrote: >>>>> >>>>>> WHAT: Merge a tmp branch for fault recovery development into the OMPI >>>>>> trunk >>>>>> >>>>>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault >>>>>> recovery capabilities >>>>>> >>>>>> WHERE: Impacts a number of ORTE files and a small number of OMPI files >>>>>> >>>>>> TIMEOUT: Barring objections and/or requests for delay, the weekend of >>>>>> Feb 27-28 >>>>>> >>>>>> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/ >>>>>> >>>>>> ====================================================================== >>>>>> >>>>>> BACKGROUND: >>>>>> >>>>>> Josh and Ralph have been working on a private branch off of the trunk on >>>>>> extended fault recovery procedures, mostly impacting ORTE. The new code >>>>>> optionally allows ORTE to recover from failed nodes, moving processes to >>>>>> other nodes in order to maintain operation. In addition, the code >>>>>> provides better support for recovering from individual process failures. >>>>>> >>>>>> Not all of the work done on the private branch will be brought over in >>>>>> this commit. Some of the MPI-specific code that allows recovery from >>>>>> process failure on-the-fly will be committed separately at a later date. >>>>>> >>>>>> This commit will include the infrastructure to support those advanced >>>>>> recovery operations. Among other things, this commit will introduce a >>>>>> new "RecoS" (Recovery Service/Strategy) framework to allow multiple >>>>>> strategies for responding to failures. The default module, called >>>>>> "ignore", will stabilize the runtime environment for other RecoS >>>>>> components. In the absence of other RecoS components it will trigger the >>>>>> default behavior (abort the job) to be executed. >>>>>> >>>>>> This branch includes some configure modifications that allow a comma >>>>>> separated list of options to be passed to the '--with-ft' option. This >>>>>> allows us to enable any combination of 'cr' and 'recos' at build time, >>>>>> specifically so that the RecoS functionally can be enabled independently >>>>>> of the C/R functionality. Most of the changes outside of the ORTE layer >>>>>> are due to symbol cleanup resulting from this modification. >>>>>> >>>>>> For example, C/R specific code paths were previously incorrectly marked >>>>>> with: >>>>>> #if OPAL_ENABLE_FT == 1 >>>>>> They are now marked as, where appropriate: >>>>>> #if OPAL_ENABLE_FT_CR == 1 >>>>>> >>>>>> Additionally, C/R specific components have modified configure.m4 files >>>>>> to change: >>>>>> AS_IF([test "$ompi_want_ft" = "0"], >>>>>> to: >>>>>> AS_IF([test "$ompi_want_ft_cr" = "0"], >>>>>> >>>>>> We have created a public repo (reference branch, above) with the code to >>>>>> be merged into the trunk. Please feel free to check it out and test it. >>>>>> >>>>>> NOTE: the new recovery capability is only active if... >>>>>> (a) you configure --with-ft=recos, and >>>>>> (b) you set OMPI_MCA_recos_base_enable=1 to turn it on! >>>>>> >>>>>> Comments, suggestions, and corrections are welcome! >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel