Ralph, Josh,

We have some comments about the API of the new framework, mostly clarifications 
needed to better understand how this new framework is supposed to be used. And 
a request for a deadline extension, to delay the code merge from the Recos 
branch in the trunk by a week.

We have our own FT branch, with a totally different approach than what is 
described in your RFC. Unfortunately, it diverged from the trunk about a year 
ago, and merging back had proven to be a quite difficult task. Some of the 
functionality in the Recos framework is clearly beneficial for what we did, and 
has the potential to facilitate the porting of most of the features from our 
brach back in trunk. We would like the deadline extension in order to deeply 
analyze the impact of the Recos framework on our work, and see how we can fit 
everything together back in the trunk of Open MPI.

Here are some comments about the code:

1. The documentation in recos.h is not very clear. Most of the functions use 
only IN arguments, and are not supposed to return any values. We don't see how 
the functions are supposed to be used, and what is supposed to be their impact 
on the ORTE framework data.

2. Why do we have all the char***? Why are they only declared as IN arguments?

3. The orte_recos_base_process_fault_fn_t function use the node_list as an 
IN/OUT argument. Why? If the list is modified, then we have a scalability 
problem, as the list will have to be rebuilt before each call.

  Thanks,
    george.

On Feb 19, 2010, at 12:59 , Ralph Castain wrote:

> WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk
> 
> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery 
> capabilities
> 
> WHERE: Impacts a number of ORTE files and a small number of OMPI files
> 
> TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 
> 27-28
> 
> REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/
> 
> ======================================================================
> 
> BACKGROUND:
> 
> Josh and Ralph have been working on a private branch off of the trunk on 
> extended fault recovery procedures, mostly impacting ORTE. The new code 
> optionally allows ORTE to recover from failed nodes, moving processes to 
> other nodes in order to maintain operation. In addition, the code provides 
> better support for recovering from individual process failures.
> 
> Not all of the work done on the private branch will be brought over in this 
> commit. Some of the MPI-specific code that allows recovery from process 
> failure on-the-fly will be committed separately at a later date.
> 
> This commit will include the infrastructure to support those advanced 
> recovery operations. Among other things, this commit will introduce a new 
> "RecoS" (Recovery Service/Strategy) framework to allow multiple strategies 
> for responding to failures. The default module, called "ignore", will 
> stabilize the runtime environment for other RecoS components. In the absence 
> of other RecoS components it will trigger the default behavior (abort the 
> job) to be executed.
> 
> This branch includes some configure modifications that allow a comma 
> separated list of options to be passed to the '--with-ft' option. This allows 
> us to enable any combination of 'cr' and 'recos' at build time, specifically 
> so that the RecoS functionally can be enabled independently of the C/R 
> functionality. Most of the changes outside of the ORTE layer are due to 
> symbol cleanup resulting from this modification.
> 
> For example, C/R specific code paths were previously incorrectly marked with:
> #if OPAL_ENABLE_FT == 1
> They are now marked as, where appropriate:
> #if OPAL_ENABLE_FT_CR == 1
> 
> Additionally, C/R specific components have modified configure.m4 files to 
> change:
> AS_IF([test "$ompi_want_ft" = "0"],
> to:
> AS_IF([test "$ompi_want_ft_cr" = "0"],
> 
> We have created a public repo (reference branch, above) with the code to be 
> merged into the trunk. Please feel free to check it out and test it.
> 
> NOTE: the new recovery capability is only active if...
>   (a) you configure --with-ft=recos, and
>   (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!
> 
> Comments, suggestions, and corrections are welcome!
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to