WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk

WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery 
capabilities

WHERE: Impacts a number of ORTE files and a small number of OMPI files

TIMEOUT: Barring objections and/or requests for delay, the weekend of Feb 27-28

REFERENCE BRANCH: http://bitbucket.org/rhc/ompi-recos/overview/

======================================================================

BACKGROUND:

Josh and Ralph have been working on a private branch off of the trunk on 
extended fault recovery procedures, mostly impacting ORTE. The new code 
optionally allows ORTE to recover from failed nodes, moving processes to other 
nodes in order to maintain operation. In addition, the code provides better 
support for recovering from individual process failures.

Not all of the work done on the private branch will be brought over in this 
commit. Some of the MPI-specific code that allows recovery from process failure 
on-the-fly will be committed separately at a later date.

This commit will include the infrastructure to support those advanced recovery 
operations. Among other things, this commit will introduce a new "RecoS" 
(Recovery Service/Strategy) framework to allow multiple strategies for 
responding to failures. The default module, called "ignore", will stabilize the 
runtime environment for other RecoS components. In the absence of other RecoS 
components it will trigger the default behavior (abort the job) to be executed.

This branch includes some configure modifications that allow a comma separated 
list of options to be passed to the '--with-ft' option. This allows us to 
enable any combination of 'cr' and 'recos' at build time, specifically so that 
the RecoS functionally can be enabled independently of the C/R functionality. 
Most of the changes outside of the ORTE layer are due to symbol cleanup 
resulting from this modification.

For example, C/R specific code paths were previously incorrectly marked with:
#if OPAL_ENABLE_FT == 1
They are now marked as, where appropriate:
#if OPAL_ENABLE_FT_CR == 1

Additionally, C/R specific components have modified configure.m4 files to 
change:
AS_IF([test "$ompi_want_ft" = "0"],
to:
AS_IF([test "$ompi_want_ft_cr" = "0"],

We have created a public repo (reference branch, above) with the code to be 
merged into the trunk. Please feel free to check it out and test it.

NOTE: the new recovery capability is only active if...
  (a) you configure --with-ft=recos, and
  (b) you set OMPI_MCA_recos_base_enable=1 to turn it on!

Comments, suggestions, and corrections are welcome!

Reply via email to