This has been committed in r22872.
Let me know if you see any problems with the commit.
-- Josh
On Mar 23, 2010, at 7:57 AM, Joshua Hursey wrote:
Just a reminder that this RFC will go into the trunk this evening
unless there are strong objections.
We intend to let this soak for a few days then bring it over to the
1.5 series (after the 1.5.0 release).
-- Josh
On Mar 15, 2010, at 9:26 AM, Josh Hursey wrote:
(Updated RFC, per offline discussion)
WHAT: Merge a tmp branch for fault recovery development into the
OMPI trunk
WHY: Bring over work done by Josh and Ralph to extend OMPI's fault
recovery capabilities
WHERE: Impacts a number of ORTE files and a ORTE ErrMgr framework
TIMEOUT: Barring objections and/or further requests for delay,
evening of March 23
REFERENCE BRANCH: http://bitbucket.org/jjhursey/orte-errmgr/
=
=====================================================================
BACKGROUND:
Josh and Ralph have been working on a private branch off of the
trunk on extended fault recovery procedures, mostly impacting ORTE.
The new code optionally allows ORTE to recover from failed nodes,
moving processes to other nodes in order to maintain operation. In
addition, the code provides better support for recovering from
individual process failures.
Not all of the work done on the private branch will be brought over
in this commit. Some of the MPI-specific code that allows recovery
from process failure on-the-fly will be committed separately at a
later date. This commit provides the foundation for ORTE
stabilization that can be built upon to provide OMPI layer
stability in the future.
This commit significantly modifies the ORTE ErrMgr framework to
support those advanced recovery operations. The ErrMgr public
interface has been preserved since it is used in various places
throughout the codebase, and should continue to be used as normal.
The ErrMgr framework has been internally redesigned to better
support multiple strategies for responding to failures (represents
a merge of the old ErrMgr and the RecoS framework, into the ErrMgr
3.0 component interface). The default (base) mode will continue to
work exactly the same as today, aborting the job when a failure
occurs. However, if the user elects to enable recovery then one or
more ErrMgr components will be activated to determine the recovery
policy for the job.
We have created a public repo (reference branch, above) with the
code to be merged into the trunk (r22815). Please feel free to
check it out and test it.
NOTE: The new recovery capability is only active if the user elects
to use it by setting the MCA parameter errmgr_base_enable_recovery
to '1'.
NOTE: More ErrMgr recovery components will be coming online in the
near future, currently this branch only includes the 'orcm' module
for ORTE process recovery (not MPI processes). If you want to
experiment with this feature, below are the MCA parameters that you
will need to get started.
#################################
plm=rsh
rmaps=resilient
routed=cm
errmgr_base_enable_recovery=1
#################################
Comments, suggestions, and corrections are welcome!
On Mar 10, 2010, at 2:22 PM, Josh Hursey wrote:
Wesley,
Thanks for catching that oversight. Below are the MCA parameters
that you should need at the moment:
#####################################
# Use the C/R Process Migration Recovery Supervisor
recos_base_enable=1
# Only use the 'rsh' launcher, other launchers will be supported
later
plm=rsh
# The resilient mapper knows how to use RecoS and deal with
recovering procs
rmaps=resilient
# 'cm' component is the only one that can handle failures at the
moment
routed=cm
#####################################
Let me know if you have any troubles.
-- Josh
On Mar 10, 2010, at 10:36 AM, Wesley Bland wrote:
Josh,
You mentioned some MCA parameters that you would include in the
email, but I don't see those parameters anywhere. Could you
please put those in here to make testing easier for people.
Wesley
On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey <jjhursey@open-
mpi.org> wrote:
Yesterday evening George, Thomas and I discussed some of their
concerns about this RFC at the MPI Forum meeting. After the
discussion, we seemed to be in agreement that the RecoS framework
is a good idea and the concepts and fixes in this RFC should move
forward with a couple of notes:
- They wanted to test the branch a bit more over the next couple
of days. Some MCA parameters that you will need are at the bottom
of this message.
- Reiterate that this RFC only addresses ORTE stability, not OMPI
stability. The OMPI stability extension is a second step for the
line of work, and should/will fit in nicely with the RecoS
framework being proposed in this RFC. The OMPI layer stability
will require a significant amount of work, but the RecoS
framework will provide the ORTE layer stability that is required
as a foundation for OMPI layer stability in the future.
- The purpose of the ErrMgr becomes slightly unclear with the
addition of the RecoS framework, since both are focused on
responding to faults in the system (and RecoS, when enabled,
overrides most/all of the ErrMgr functionality). Should the RecoS
framework be merged with the ErrMgr framework to create a new
ErrMgr interface?
We are typing to decide if we should merge these frameworks, but
at this point we are interested in hearing how other developers
feel about merging the ErrMgr and RecoS frameworks, which would
change the ErrMgr API. Are there any developers out there that
are developing ErrMgr components, or are using any particular
features of the existing ErrMgr framework that they would like to
see preserved in the next revision. By default, the existing
default abort behavior of the ErrMgr framework will be preserved,
so the user will have to 'opt-in' to any fault recovery
capabilities.
So we are continuing the discussion a bit more off-list, and will
return to the list with an updated RFC (and possibly a new
branch) soon (hopefully end of the week/early next week). I would
like to briefly discuss this RFC at the Open MPI teleconf next
Tuesday.
-- Josh
On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote:
Sounds good to me.
For those casually following this RFC let me summarize its
current state.
Josh and George (and anyone else that wishes to participate
attending the forum) will meet sometime at the next MPI Forum
meeting (March 8-10). I will post any relevant notes from this
meeting back to the list afterwards. So the RFC is on hold
pending the outcome of that meeting. For those developers
interested in this RFC that will not be able to attend, feel
free to continue using this thread for discussion.
Thanks,
Josh
On Feb 26, 2010, at 6:09 AM, George Bosilca wrote:
On Feb 26, 2010, at 01:50 , Josh Hursey wrote:
Any of those options are fine with me. I was thinking that if
you wanted to talk sooner, we might be able to help explain
our intentions with this framework a bit better. I figure that
the framework interface will change a bit as we all advance
and incorporate our various techniques into it. I think that
the current interface is a good first step, but there are
certainly many more steps to come.
I am fine delaying this code a bit, just not too long. Meeting
at the forum for a while might be a good option (we could
probably even arrange to call in others if you wanted).
Sounds good, let do this.
Thanks,
george.
Cheers,
Josh
On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote:
If Josh is going to be at the forum, perhaps you folks could
chat there? Might as well take advantage of being colocated,
if possible.
Otherwise, I'm available pretty much any time. I can't
contribute much about the MPI recovery issues, but can
contribute to the RTE issues if that helps.
On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca <bosi...@eecs.utk.edu
> wrote:
Josh,
Next week is a little bit too early as will need some time to
figure out how to integrate with this new framework, and at
what extent our code and requirements fit into. Then the week
after is the MPI Forum. How about on Thursday 11 March?
Thanks,
george.
On Feb 25, 2010, at 12:46 , Josh Hursey wrote:
Per my previous suggestion, would it be useful to chat on
the phone early next week about our various strategies?
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel