Hi George et al

I have begun documenting the RecoS operation on the OMPI wiki:

https://svn.open-mpi.org/trac/ompi/wiki/RecoS

I'll continue to work on this over the next few days by adding a section 
explaining what was changed outside of the new framework to make it all work. 
In addition, I am revising the recos.h API documentation.

Hope to have all that done over the weekend.


On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote:

> 
> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
> 
>> Ralph, Josh,
>> 
>> We have some comments about the API of the new framework, mostly 
>> clarifications needed to better understand how this new framework is 
>> supposed to be used. And a request for a deadline extension, to delay the 
>> code merge from the Recos branch in the trunk by a week.
>> 
>> We have our own FT branch, with a totally different approach than what is 
>> described in your RFC. Unfortunately, it diverged from the trunk about a 
>> year ago, and merging back had proven to be a quite difficult task. Some of 
>> the functionality in the Recos framework is clearly beneficial for what we 
>> did, and has the potential to facilitate the porting of most of the features 
>> from our brach back in trunk. We would like the deadline extension in order 
>> to deeply analyze the impact of the Recos framework on our work, and see how 
>> we can fit everything together back in the trunk of Open MPI.
> 
> No problem with the extension - feel free to suggest modifications to make 
> the merge easier. This is by no means cast in stone, but rather a starting 
> point.
> 
>> 
>> Here are some comments about the code:
>> 
>> 1. The documentation in recos.h is not very clear. Most of the functions use 
>> only IN arguments, and are not supposed to return any values. We don't see 
>> how the functions are supposed to be used, and what is supposed to be their 
>> impact on the ORTE framework data.
> 
> I'll try to clarify the comments tonight (I know Josh is occupied right now). 
> The recos APIs are called from two locations:
> 
> 1. The errmgr calls recos whenever it receives a report of an aborted process 
> (via the errmgr.proc_aborted API). The idea was for recos to determine what 
> (if anything) to do about the failed process. 
> 
> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a 
> list of suggested nodes for the process that is to be restarted. At the 
> moment, only the resilient mapper module does this. However, Josh and I are 
> looking at reorganizing some functionality currently in that mapper module 
> and making all of the existing mappers be "resilient".
> 
> So basically, the recos modules determine the recovery procedure and execute 
> it. For example, in the "orcm" module, we actually update the various 
> proc/job objects to prep them for restart and call plm.spawn from within that 
> module. If instead you use the ignore module, it falls through to the recos 
> base functions which call "abort" to kill the job. Again, the action is taken 
> local to recos, so nothing need be returned.
> 
> The functions generally don't return values (other than success/error) 
> because we couldn't think of anything useful to return to the errmgr. 
> Whatever recos does about an aborted proc, the errmgr doesn't do anything 
> further - if you look in that code, you'll see that if recos is enabled, all 
> the errmgr does is call recos and return.
> 
> Again, this can be changed if desired.
> 
>> 
>> 2. Why do we have all the char***? Why are they only declared as IN 
>> arguments?
> 
> I take it you mean in the predicted fault API? I believe Josh was including 
> that strictly as a placeholder. As you undoubtedly recall, I removed the fddp 
> framework from the trunk (devel continues off-line), so Josh wasn't sure what 
> I might want to input here. If you look at the modules themselves, you will 
> see the implementation is essentially empty at this time.
> 
> We had discussed simply removing that API for now until we determined if/when 
> fault prediction would return to the OMPI trunk. It was kind of a tossup - so 
> we left if for now. Could just as easily be removed until a later date - 
> either way is fine with us.
> 
>> 
>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an 
>> IN/OUT argument. Why? If the list is modified, then we have a scalability 
>> problem, as the list will have to be rebuilt before each call.
> 
> Looking...looking...hmm.
> 
> typedef int (*orte_recos_base_process_fault_fn_t)
>     (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t 
> state, int *stack_state);
> 
> There is no node list, or list of any type, going in or out of that function. 
> I suspect you meant the one below it:
> 
> typedef int (*orte_recos_base_suggest_map_targets_fn_t)
>     (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list);
> 
> I concur with your concern about scalability here. However, I believe the 
> idea was that we would pass in the proc that failed and is to be restarted, a 
> pointer to the node it was last on, and return a list of candidate nodes 
> where it could be restarted. Essentially, this is the equivalent of building 
> the target node list that we do in the mappers whenever we map a job.
> 
> So in the implementation, we use the rmaps base function to assemble the 
> target node list for the app, and then go through some logic (e.g., remove 
> the old node, look at fault groups and load balancing) to prune the list 
> down. We then pass the resulting list back to the caller.
> 
> If we are going to have frequent process failures, then rebuilding the 
> candidate node list every time would indeed be a problem. I suspect we'll 
> have to revisit that implementation at some point.
> 
> HTH
> Ralph
> 
>> 
>>  Thanks,
>>    george.
>> 
>> On Feb 19, 2010, at 12:59 , Ralph Castain wrote:
>> 
>>> WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk
>>> 
>>> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery 
>>> capabilities
>>> 
>>> WHERE: Impacts a number of ORTE files and a small number of OMPI files
>>> 
>>> TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 
>>> 27-28
>>> 
>>> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/
>>> 
>>> ======================================================================
>>> 
>>> BACKGROUND:
>>> 
>>> Josh and Ralph have been working on a private branch off of the trunk on 
>>> extended fault recovery procedures, mostly impacting ORTE. The new code 
>>> optionally allows ORTE to recover from failed nodes, moving processes to 
>>> other nodes in order to maintain operation. In addition, the code provides 
>>> better support for recovering from individual process failures.
>>> 
>>> Not all of the work done on the private branch will be brought over in this 
>>> commit. Some of the MPI-specific code that allows recovery from process 
>>> failure on-the-fly will be committed separately at a later date.
>>> 
>>> This commit will include the infrastructure to support those advanced 
>>> recovery operations. Among other things, this commit will introduce a new 
>>> "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies 
>>> for responding to failures. The default module, called "ignore", will 
>>> stabilize the runtime environment for other RecoS components. In the 
>>> absence of other RecoS components it will trigger the default behavior 
>>> (abort the job) to be executed.
>>> 
>>> This branch includes some configure modifications that allow a comma 
>>> separated list of options to be passed to the '--with-ft' option. This 
>>> allows us to enable any combination of 'cr' and 'recos' at build time, 
>>> specifically so that the RecoS functionally can be enabled independently of 
>>> the C/R functionality. Most of the changes outside of the ORTE layer are 
>>> due to symbol cleanup resulting from this modification.
>>> 
>>> For example, C/R specific code paths were previously incorrectly marked 
>>> with:
>>> #if OPAL_ENABLE_FT == 1
>>> They are now marked as, where appropriate:
>>> #if OPAL_ENABLE_FT_CR == 1
>>> 
>>> Additionally, C/R specific components have modified configure.m4 files to 
>>> change:
>>> AS_IF([test "$ompi_want_ft" = "0"],
>>> to:
>>> AS_IF([test "$ompi_want_ft_cr" = "0"],
>>> 
>>> We have created a public repo (reference branch, above) with the code to be 
>>> merged into the trunk. Please feel free to check it out and test it.
>>> 
>>> NOTE: the new recovery capability is only active if...
>>>  (a) you configure --with-ft=recos, and
>>>  (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!
>>> 
>>> Comments, suggestions, and corrections are welcome!
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

Reply via email to