Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Josh Hursey Thu, 25 Feb 2010 10:55:07 -0500

On Feb 23, 2010, at 3:00 PM, Ralph Castain wrote:

> 
> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
> 
>> Ralph, Josh,
>> 
>> We have some comments about the API of the new framework, mostly 
>> clarifications needed to better understand how this new framework is 
>> supposed to be used. And a request for a deadline extension, to delay the 
>> code merge from the Recos branch in the trunk by a week.
>> 
>> We have our own FT branch, with a totally different approach than what is 
>> described in your RFC. Unfortunately, it diverged from the trunk about a 
>> year ago, and merging back had proven to be a quite difficult task. Some of 
>> the functionality in the Recos framework is clearly beneficial for what we 
>> did, and has the potential to facilitate the porting of most of the features 
>> from our brach back in trunk. We would like the deadline extension in order 
>> to deeply analyze the impact of the Recos framework on our work, and see how 
>> we can fit everything together back in the trunk of Open MPI.
> 
> No problem with the extension - feel free to suggest modifications to make 
> the merge easier. This is by no means cast in stone, but rather a starting 
> point.


Additionally, if you wanted to have a teleconf next week to increase the 
bandwidth of communication we can do that as well. Might help us negotiate some 
modifications that would be mutually beneficial. Unfortunately I am currently 
at a conference so cannot call in until Monday.

> 
>> 
>> Here are some comments about the code:
>> 
>> 1. The documentation in recos.h is not very clear. Most of the functions use 
>> only IN arguments, and are not supposed to return any values. We don't see 
>> how the functions are supposed to be used, and what is supposed to be their 
>> impact on the ORTE framework data.
> 
> I'll try to clarify the comments tonight (I know Josh is occupied right now). 
> The recos APIs are called from two locations:
> 
> 1. The errmgr calls recos whenever it receives a report of an aborted process 
> (via the errmgr.proc_aborted API). The idea was for recos to determine what 
> (if anything) to do about the failed process. 
> 
> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a 
> list of suggested nodes for the process that is to be restarted. At the 
> moment, only the resilient mapper module does this. However, Josh and I are 
> looking at reorganizing some functionality currently in that mapper module 
> and making all of the existing mappers be "resilient".
> 
> So basically, the recos modules determine the recovery procedure and execute 
> it. For example, in the "orcm" module, we actually update the various 
> proc/job objects to prep them for restart and call plm.spawn from within that 
> module. If instead you use the ignore module, it falls through to the recos 
> base functions which call "abort" to kill the job. Again, the action is taken 
> local to recos, so nothing need be returned.
> 
> The functions generally don't return values (other than success/error) 
> because we couldn't think of anything useful to return to the errmgr. 
> Whatever recos does about an aborted proc, the errmgr doesn't do anything 
> further - if you look in that code, you'll see that if recos is enabled, all 
> the errmgr does is call recos and return.
> 
> Again, this can be changed if desired.
> 
>> 
>> 2. Why do we have all the char***? Why are they only declared as IN 
>> arguments?
> 
> I take it you mean in the predicted fault API? I believe Josh was including 
> that strictly as a placeholder. As you undoubtedly recall, I removed the fddp 
> framework from the trunk (devel continues off-line), so Josh wasn't sure what 
> I might want to input here. If you look at the modules themselves, you will 
> see the implementation is essentially empty at this time.
> 
> We had discussed simply removing that API for now until we determined if/when 
> fault prediction would return to the OMPI trunk. It was kind of a tossup - so 
> we left if for now. Could just as easily be removed until a later date - 
> either way is fine with us.

In this version of the components, none of them use the predicted_fault API. I 
have at least one component that will come in as a second step (so soon, but 
different RFC) that does use this interface to do some super nifty things (if I 
say so myself :).

We can remove the interface if people have heartburn about it being there, but 
we will want to add it back in soon enough.

As far as the 'char ***' parameters they really should just be IN parameters. 
They are not passed back to the suggestion/detection agent (though I guess they 
could be). In recognition of some of the broader uses of this interface I am 
considering changing them to a list of RecoS specific structures that would 
allow the caller of this function to pass additional information for each of 
the parameters (like an assurance level of the fault - 75% sure this proc is 
failed).

So we would change:
 (char ***proc_list, char ***node_list, char ***suggested_nodes)
to:
 (orte_recos_proc_t *proc_list, orte_recos_node_t *node_list, orte_recos_node_t 
*suggested_nodes)

What do people think of this?

> 
>> 
>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an 
>> IN/OUT argument. Why? If the list is modified, then we have a scalability 
>> problem, as the list will have to be rebuilt before each call.
> 
> Looking...looking...hmm.
> 
> typedef int (*orte_recos_base_process_fault_fn_t)
>     (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t 
> state, int *stack_state);
> 
> There is no node list, or list of any type, going in or out of that function. 
> I suspect you meant the one below it:
> 
> typedef int (*orte_recos_base_suggest_map_targets_fn_t)
>     (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list);
> 
> I concur with your concern about scalability here. However, I believe the 
> idea was that we would pass in the proc that failed and is to be restarted, a 
> pointer to the node it was last on, and return a list of candidate nodes 
> where it could be restarted. Essentially, this is the equivalent of building 
> the target node list that we do in the mappers whenever we map a job.
> 
> So in the implementation, we use the rmaps base function to assemble the 
> target node list for the app, and then go through some logic (e.g., remove 
> the old node, look at fault groups and load balancing) to prune the list 
> down. We then pass the resulting list back to the caller.
> 
> If we are going to have frequent process failures, then rebuilding the 
> candidate node list every time would indeed be a problem. I suspect we'll 
> have to revisit that implementation at some point.

Yeah there is a bit of a scalability issue for failures in large systems and/or 
frequent failures. I don't think this should hold up the RFC, since it is more 
of a mapper decision than a RecoS discussion. The mapper constructs the target 
list of nodes for every mapper already (it might be cached to reduce the 
'construction' time), and then the list is just passed to the RecoS framework. 
This allows something like the fault groups technique (currently part of the 
resilient mapper) to be abstracted away into the RecoS framework (so we can 
apply it to any mapper). Continuing with this example, it allows the RecoS 
framework to use accumulated fault information to determine the best placement 
for a recovering process. Some of the soon to arrive components also use this 
interface to do nifty things, so it does have multiple uses.

-- Josh

> 
> HTH
> Ralph
> 
>> 
>>  Thanks,
>>    george.
>> 
>> On Feb 19, 2010, at 12:59 , Ralph Castain wrote:
>> 
>>> WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk
>>> 
>>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery 
>>> capabilities
>>> 
>>> WHERE: Impacts a number of ORTE files and a small number of OMPI files
>>> 
>>> TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 
>>> 27-28
>>> 
>>> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/
>>> 
>>> ======================================================================
>>> 
>>> BACKGROUND:
>>> 
>>> Josh and Ralph have been working on a private branch off of the trunk on 
>>> extended fault recovery procedures, mostly impacting ORTE. The new code 
>>> optionally allows ORTE to recover from failed nodes, moving processes to 
>>> other nodes in order to maintain operation. In addition, the code provides 
>>> better support for recovering from individual process failures.
>>> 
>>> Not all of the work done on the private branch will be brought over in this 
>>> commit. Some of the MPI-specific code that allows recovery from process 
>>> failure on-the-fly will be committed separately at a later date.
>>> 
>>> This commit will include the infrastructure to support those advanced 
>>> recovery operations. Among other things, this commit will introduce a new 
>>> "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies 
>>> for responding to failures. The default module, called "ignore", will 
>>> stabilize the runtime environment for other RecoS components. In the 
>>> absence of other RecoS components it will trigger the default behavior 
>>> (abort the job) to be executed.
>>> 
>>> This branch includes some configure modifications that allow a comma 
>>> separated list of options to be passed to the '--with-ft' option. This 
>>> allows us to enable any combination of 'cr' and 'recos' at build time, 
>>> specifically so that the RecoS functionally can be enabled independently of 
>>> the C/R functionality. Most of the changes outside of the ORTE layer are 
>>> due to symbol cleanup resulting from this modification.
>>> 
>>> For example, C/R specific code paths were previously incorrectly marked 
>>> with:
>>> #if OPAL_ENABLE_FT == 1
>>> They are now marked as, where appropriate:
>>> #if OPAL_ENABLE_FT_CR == 1
>>> 
>>> Additionally, C/R specific components have modified configure.m4 files to 
>>> change:
>>> AS_IF([test "$ompi_want_ft" = "0"],
>>> to:
>>> AS_IF([test "$ompi_want_ft_cr" = "0"],
>>> 
>>> We have created a public repo (reference branch, above) with the code to be 
>>> merged into the trunk. Please feel free to check it out and test it.
>>> 
>>> NOTE: the new recovery capability is only active if...
>>>  (a) you configure --with-ft=recos, and
>>>  (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!
>>> 
>>> Comments, suggestions, and corrections are welcome!
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

Reply via email to