Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Josh Hursey Thu, 25 Feb 2010 11:01:23 -0500

On Feb 25, 2010, at 4:38 AM, Ralph Castain wrote:

> 
> On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote:
> 
>> Hi Ralph,
>> 
>> Very interesting the "composite framework" idea.
> 
> Josh is the force behind that idea :-)


It solves a pretty interesting little problem. Its utility will really shine 
when I move the new components into place in the coming weeks/month.

> 
>> Regarding to the schema represented by the picture, I didn't understand the 
>> RecoS' behaviour in a node failure situation.
>> 
>> In this case, will mpirun consider the daemon failure as a normal proc 
>> failure? If it is correct, should mpirun update the global procs state for 
>> all jobs running under the failed daemon?
> 
> I haven't included the node failure case yet - still on my "to-do" list. In 
> brief, the answer is yes/no. :-)
> 
> Daemon failure follows the same code path as shown in the flow chart. 
> However, it is up to the individual modules to determine a response to that 
> failure. The "orcm" RecoS module response is to (a) mark all procs on that 
> node as having failed, (b) mark that node as "down" so it won't get reused, 
> and (c) remap and restart all such procs on the remaining available nodes, 
> starting new daemon(s) as required.
> 
> In the orcm environment, nodes that are replaced or rebooted automatically 
> start their own daemon. This is detected by orcm, and the node state (if the 
> node is rebooted) will automatically be updated to "up" - if it is a new 
> node, it is automatically added to the available resources. This allows the 
> node to be reused once the problem has been corrected. In other environments 
> (ssh, slurm, etc), the node is simply left as "down" as there is no way to 
> know if/when the node becomes available again.
> 
> If you aren't using the "orcm" module, then the default behavior will abort 
> the job.

Just to echo this response. The orted and process failures use the same error 
path, but can be easily differentiated by their jobids. The 'orcm' component is 
a good example of differentiating these two fault scenarios to correctly 
recover the ORTE job. Soon we may/should/will have the same ability with 
certain MPI jobs. :)

-- Josh

> 
> 
>> 
>> Best regards,
>> Leonardo
>> 
>> On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote:
>> 
>>> Hi George et al
>>> 
>>> I have begun documenting the RecoS operation on the OMPI wiki:
>>> 
>>> https://svn.open-mpi.org/trac/ompi/wiki/RecoS
>>> 
>>> I'll continue to work on this over the next few days by adding a section 
>>> explaining what was changed outside of the new framework to make it all 
>>> work. In addition, I am revising the recos.h API documentation.
>>> 
>>> Hope to have all that done over the weekend.
>>> 
>>> 
>>> On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote:
>>> 
>>>> 
>>>> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
>>>> 
>>>>> Ralph, Josh,
>>>>> 
>>>>> We have some comments about the API of the new framework, mostly 
>>>>> clarifications needed to better understand how this new framework is 
>>>>> supposed to be used. And a request for a deadline extension, to delay the 
>>>>> code merge from the Recos branch in the trunk by a week.
>>>>> 
>>>>> We have our own FT branch, with a totally different approach than what is 
>>>>> described in your RFC. Unfortunately, it diverged from the trunk about a 
>>>>> year ago, and merging back had proven to be a quite difficult task. Some 
>>>>> of the functionality in the Recos framework is clearly beneficial for 
>>>>> what we did, and has the potential to facilitate the porting of most of 
>>>>> the features from our brach back in trunk. We would like the deadline 
>>>>> extension in order to deeply analyze the impact of the Recos framework on 
>>>>> our work, and see how we can fit everything together back in the trunk of 
>>>>> Open MPI.
>>>> 
>>>> No problem with the extension - feel free to suggest modifications to make 
>>>> the merge easier. This is by no means cast in stone, but rather a starting 
>>>> point.
>>>> 
>>>>> 
>>>>> Here are some comments about the code:
>>>>> 
>>>>> 1. The documentation in recos.h is not very clear. Most of the functions 
>>>>> use only IN arguments, and are not supposed to return any values. We 
>>>>> don't see how the functions are supposed to be used, and what is supposed 
>>>>> to be their impact on the ORTE framework data.
>>>> 
>>>> I'll try to clarify the comments tonight (I know Josh is occupied right 
>>>> now). The recos APIs are called from two locations:
>>>> 
>>>> 1. The errmgr calls recos whenever it receives a report of an aborted 
>>>> process (via the errmgr.proc_aborted API). The idea was for recos to 
>>>> determine what (if anything) to do about the failed process. 
>>>> 
>>>> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a 
>>>> list of suggested nodes for the process that is to be restarted. At the 
>>>> moment, only the resilient mapper module does this. However, Josh and I 
>>>> are looking at reorganizing some functionality currently in that mapper 
>>>> module and making all of the existing mappers be "resilient".
>>>> 
>>>> So basically, the recos modules determine the recovery procedure and 
>>>> execute it. For example, in the "orcm" module, we actually update the 
>>>> various proc/job objects to prep them for restart and call plm.spawn from 
>>>> within that module. If instead you use the ignore module, it falls through 
>>>> to the recos base functions which call "abort" to kill the job. Again, the 
>>>> action is taken local to recos, so nothing need be returned.
>>>> 
>>>> The functions generally don't return values (other than success/error) 
>>>> because we couldn't think of anything useful to return to the errmgr. 
>>>> Whatever recos does about an aborted proc, the errmgr doesn't do anything 
>>>> further - if you look in that code, you'll see that if recos is enabled, 
>>>> all the errmgr does is call recos and return.
>>>> 
>>>> Again, this can be changed if desired.
>>>> 
>>>>> 
>>>>> 2. Why do we have all the char***? Why are they only declared as IN 
>>>>> arguments?
>>>> 
>>>> I take it you mean in the predicted fault API? I believe Josh was 
>>>> including that strictly as a placeholder. As you undoubtedly recall, I 
>>>> removed the fddp framework from the trunk (devel continues off-line), so 
>>>> Josh wasn't sure what I might want to input here. If you look at the 
>>>> modules themselves, you will see the implementation is essentially empty 
>>>> at this time.
>>>> 
>>>> We had discussed simply removing that API for now until we determined 
>>>> if/when fault prediction would return to the OMPI trunk. It was kind of a 
>>>> tossup - so we left if for now. Could just as easily be removed until a 
>>>> later date - either way is fine with us.
>>>> 
>>>>> 
>>>>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as 
>>>>> an IN/OUT argument. Why? If the list is modified, then we have a 
>>>>> scalability problem, as the list will have to be rebuilt before each call.
>>>> 
>>>> Looking...looking...hmm.
>>>> 
>>>> typedef int (*orte_recos_base_process_fault_fn_t)
>>>>     (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t 
>>>> state, int *stack_state);
>>>> 
>>>> There is no node list, or list of any type, going in or out of that 
>>>> function. I suspect you meant the one below it:
>>>> 
>>>> typedef int (*orte_recos_base_suggest_map_targets_fn_t)
>>>>     (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list);
>>>> 
>>>> I concur with your concern about scalability here. However, I believe the 
>>>> idea was that we would pass in the proc that failed and is to be 
>>>> restarted, a pointer to the node it was last on, and return a list of 
>>>> candidate nodes where it could be restarted. Essentially, this is the 
>>>> equivalent of building the target node list that we do in the mappers 
>>>> whenever we map a job.
>>>> 
>>>> So in the implementation, we use the rmaps base function to assemble the 
>>>> target node list for the app, and then go through some logic (e.g., remove 
>>>> the old node, look at fault groups and load balancing) to prune the list 
>>>> down. We then pass the resulting list back to the caller.
>>>> 
>>>> If we are going to have frequent process failures, then rebuilding the 
>>>> candidate node list every time would indeed be a problem. I suspect we'll 
>>>> have to revisit that implementation at some point.
>>>> 
>>>> HTH
>>>> Ralph
>>>> 
>>>>> 
>>>>>  Thanks,
>>>>>    george.
>>>>> 
>>>>> On Feb 19, 2010, at 12:59 , Ralph Castain wrote:
>>>>> 
>>>>>> WHAT: Merge a tmp branch for fault recovery development into the OMPI 
>>>>>> trunk
>>>>>> 
>>>>>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault 
>>>>>> recovery capabilities
>>>>>> 
>>>>>> WHERE: Impacts a number of ORTE files and a small number of OMPI files
>>>>>> 
>>>>>> TIMEOUT: Barring objections and/or requests for delay, the weekend of 
>>>>>> Feb 27-28
>>>>>> 
>>>>>> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/
>>>>>> 
>>>>>> ======================================================================
>>>>>> 
>>>>>> BACKGROUND:
>>>>>> 
>>>>>> Josh and Ralph have been working on a private branch off of the trunk on 
>>>>>> extended fault recovery procedures, mostly impacting ORTE. The new code 
>>>>>> optionally allows ORTE to recover from failed nodes, moving processes to 
>>>>>> other nodes in order to maintain operation. In addition, the code 
>>>>>> provides better support for recovering from individual process failures.
>>>>>> 
>>>>>> Not all of the work done on the private branch will be brought over in 
>>>>>> this commit. Some of the MPI-specific code that allows recovery from 
>>>>>> process failure on-the-fly will be committed separately at a later date.
>>>>>> 
>>>>>> This commit will include the infrastructure to support those advanced 
>>>>>> recovery operations. Among other things, this commit will introduce a 
>>>>>> new "RecoS" (Recovery Service/Strategy) framework to allow multiple 
>>>>>> strategies for responding to failures. The default module, called 
>>>>>> "ignore", will stabilize the runtime environment for other RecoS 
>>>>>> components. In the absence of other RecoS components it will trigger the 
>>>>>> default behavior (abort the job) to be executed.
>>>>>> 
>>>>>> This branch includes some configure modifications that allow a comma 
>>>>>> separated list of options to be passed to the '--with-ft' option. This 
>>>>>> allows us to enable any combination of 'cr' and 'recos' at build time, 
>>>>>> specifically so that the RecoS functionally can be enabled independently 
>>>>>> of the C/R functionality. Most of the changes outside of the ORTE layer 
>>>>>> are due to symbol cleanup resulting from this modification.
>>>>>> 
>>>>>> For example, C/R specific code paths were previously incorrectly marked 
>>>>>> with:
>>>>>> #if OPAL_ENABLE_FT == 1
>>>>>> They are now marked as, where appropriate:
>>>>>> #if OPAL_ENABLE_FT_CR == 1
>>>>>> 
>>>>>> Additionally, C/R specific components have modified configure.m4 files 
>>>>>> to change:
>>>>>> AS_IF([test "$ompi_want_ft" = "0"],
>>>>>> to:
>>>>>> AS_IF([test "$ompi_want_ft_cr" = "0"],
>>>>>> 
>>>>>> We have created a public repo (reference branch, above) with the code to 
>>>>>> be merged into the trunk. Please feel free to check it out and test it.
>>>>>> 
>>>>>> NOTE: the new recovery capability is only active if...
>>>>>>  (a) you configure --with-ft=recos, and
>>>>>>  (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!
>>>>>> 
>>>>>> Comments, suggestions, and corrections are welcome!
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Reply via email to