Re: [OMPI devel] [devel-core] Major reduction in ORTE

Jeff Squyres Thu, 12 Jul 2007 17:46:57 -0400

Thanks for the summary Ralph.

On Jul 12, 2007, at 5:04 PM, Ralph H Castain wrote:

Yo all
As we are discussing functional requirements for the upcoming 1.3release, Iwas asked to provide a little info about what is going to behappening to
the ORTE part of the code base over the remainder of this year.
Short answer: there will be a major code revision to reduce ORTE tothe
minimum required to support Open MPI. This includes (a) a major design
change away from event-driven programming that will result in the
consolidation of several frameworks and removal of at least twoothers; and(b) general cleanup to reduce memory footprint, startup messagesize, and
other areas.


Longer explanation:
At the beginning of the Open MPI project, it was quickly determinedthatnobody (myself perhaps excepted) really wanted to build/maintainthe RTEunderpinning Open MPI. We were, after all, primarily interested inMPI.Hence, we thought it would be a good thing if we could define anRTE thatwould be of adequate general interest to attract partners whoseprimary
focus would be extension and support of the RTE itself.
Well, after several years, it is clear that the original idea isn'tgoing to
work (for a variety of reasons that aren't worth recounting here). We
therefore decided recently that it is time to accept theinevitable, quittrying to support a more general RTE, and instead spend some effortreducingthe ORTE layer down to its most basic requirements. In particular,we wantto make the code easier to maintain and debug, faster and morescalable for
startup, and less vulnerable to race conditions.

In its essence, the plan consists of the following:
1. remove the cellid from the process name as the code will solelybe asingle-cluster system. Other interested parties have offered toprovide anoverlayer that will cross-connect Open MPI instances acrossclusters - we
will work with them to help facilitate the necessary hooks, but won't
duplicate that connectivity internally.
2. remove the RDS framework. All discovery and allocation will bedone in asingle step in the RAS. We will revise the RAS to allow better co-existence
of resource manager specified allocations and hostfiles (more on that
later).

3. Eliminate the GPR framework, or at the very least, removal of the
subscribe/trigger functionality from it. We will be moving awayfrom thecurrent event-driven architecture to reduce our exposure to raceconditionsand eliminate the complexity caused by recursive callbacks due totriggerevents. We will explore globalized data storage in simplifiedarrays as an
alternative to the GPR database - initial tests support the idea, but
further work needs to be done. We know that people like the EclipsePTP teamneed access to certain data - we will work with them to figure outthe best
way to do so given the changes to/departure of the GPR.
4. Consolidate the NS, PLS, RMGR, and SMR framework functionalityinto asingle process lifecycle management (PLM) framework. PLM componentswillstill call the ERRMGR to deal with response to process failures,and willassume responsibility for storing their own data. The SCHEMAframework willbe eliminated as part of this change. We will move some functions(e.g.,orte_abort) that are currently in the runtime and util areas intothe PLM
components as appropriate.
5. Each framework will have logic in their respective "open"function thatspecifically prevents them from performing component_open unless weare onthe HNP. If we are not on the HNP, an #if ORTE_WANT_NO_SUPPORT willforcethe use of a "no_op" module that does nothing, but whose returncodes will
indicate that an error did not occur. If that is not set, then a proxy
module will be utilized that provides appropriate communications tothe HNPto support remote applications. This will reduce memory footprints(since nocomponents will be opened) and allow us to simply pass-through MCAparams toall processes while ensuring proper functionality is available.Note thatenvironments like CNOS may still require special components in someof the
frameworks as the "no_op" may not be suitable for all API functions.
6. the SDS framework will not only support name discovery, but willhold allbackend operations required during startup. For example, thecontents of themessage now sent back to the new PLM by each process will bedependent uponenvironment. Hence, a one-to-one correspondence will be establishedbetween
PLM and SDS components.
7. consolidate the data in the MPI startup message (currentlydelivered atSTG1 stagegate). For example, any data in the MPI startup messagethat needsto be indexed will be sent in an array sorted by vpid (no need tosend the
entire list of process name structs). Whereas before we couldn't take
advantage of our knowledge of the message contents since it wasgenerated bythe GPR (which by design had no insight into the data), we will nowexploitour knowledge to ensure the message is only that required by thespecific
environment. We will look at, for example, the direct one-to-one
correspondence of PLM to SDS to see how this can best be implemented.


Other things (e.g., routing of RML messages) are either already under
development or under discussion - we will provide more info onthese as they
move along.

As always, any thoughts/suggestions are welcomed.

Ralph


_______________________________________________
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [devel-core] Major reduction in ORTE

Reply via email to