Thanks for the summary Ralph.
On Jul 12, 2007, at 5:04 PM, Ralph H Castain wrote:
Yo all
As we are discussing functional requirements for the upcoming 1.3
release, I
was asked to provide a little info about what is going to be
happening to
the ORTE part of the code base over the remainder of this year.
Short answer: there will be a major code revision to reduce ORTE to
the
minimum required to support Open MPI. This includes (a) a major design
change away from event-driven programming that will result in the
consolidation of several frameworks and removal of at least two
others; and
(b) general cleanup to reduce memory footprint, startup message
size, and
other areas.
Longer explanation:
At the beginning of the Open MPI project, it was quickly determined
that
nobody (myself perhaps excepted) really wanted to build/maintain
the RTE
underpinning Open MPI. We were, after all, primarily interested in
MPI.
Hence, we thought it would be a good thing if we could define an
RTE that
would be of adequate general interest to attract partners whose
primary
focus would be extension and support of the RTE itself.
Well, after several years, it is clear that the original idea isn't
going to
work (for a variety of reasons that aren't worth recounting here). We
therefore decided recently that it is time to accept the
inevitable, quit
trying to support a more general RTE, and instead spend some effort
reducing
the ORTE layer down to its most basic requirements. In particular,
we want
to make the code easier to maintain and debug, faster and more
scalable for
startup, and less vulnerable to race conditions.
In its essence, the plan consists of the following:
1. remove the cellid from the process name as the code will solely
be a
single-cluster system. Other interested parties have offered to
provide an
overlayer that will cross-connect Open MPI instances across
clusters - we
will work with them to help facilitate the necessary hooks, but won't
duplicate that connectivity internally.
2. remove the RDS framework. All discovery and allocation will be
done in a
single step in the RAS. We will revise the RAS to allow better co-
existence
of resource manager specified allocations and hostfiles (more on that
later).
3. Eliminate the GPR framework, or at the very least, removal of the
subscribe/trigger functionality from it. We will be moving away
from the
current event-driven architecture to reduce our exposure to race
conditions
and eliminate the complexity caused by recursive callbacks due to
trigger
events. We will explore globalized data storage in simplified
arrays as an
alternative to the GPR database - initial tests support the idea, but
further work needs to be done. We know that people like the Eclipse
PTP team
need access to certain data - we will work with them to figure out
the best
way to do so given the changes to/departure of the GPR.
4. Consolidate the NS, PLS, RMGR, and SMR framework functionality
into a
single process lifecycle management (PLM) framework. PLM components
will
still call the ERRMGR to deal with response to process failures,
and will
assume responsibility for storing their own data. The SCHEMA
framework will
be eliminated as part of this change. We will move some functions
(e.g.,
orte_abort) that are currently in the runtime and util areas into
the PLM
components as appropriate.
5. Each framework will have logic in their respective "open"
function that
specifically prevents them from performing component_open unless we
are on
the HNP. If we are not on the HNP, an #if ORTE_WANT_NO_SUPPORT will
force
the use of a "no_op" module that does nothing, but whose return
codes will
indicate that an error did not occur. If that is not set, then a proxy
module will be utilized that provides appropriate communications to
the HNP
to support remote applications. This will reduce memory footprints
(since no
components will be opened) and allow us to simply pass-through MCA
params to
all processes while ensuring proper functionality is available.
Note that
environments like CNOS may still require special components in some
of the
frameworks as the "no_op" may not be suitable for all API functions.
6. the SDS framework will not only support name discovery, but will
hold all
backend operations required during startup. For example, the
contents of the
message now sent back to the new PLM by each process will be
dependent upon
environment. Hence, a one-to-one correspondence will be established
between
PLM and SDS components.
7. consolidate the data in the MPI startup message (currently
delivered at
STG1 stagegate). For example, any data in the MPI startup message
that needs
to be indexed will be sent in an array sorted by vpid (no need to
send the
entire list of process name structs). Whereas before we couldn't take
advantage of our knowledge of the message contents since it was
generated by
the GPR (which by design had no insight into the data), we will now
exploit
our knowledge to ensure the message is only that required by the
specific
environment. We will look at, for example, the direct one-to-one
correspondence of PLM to SDS to see how this can best be implemented.
Other things (e.g., routing of RML messages) are either already under
development or under discussion - we will provide more info on
these as they
move along.
As always, any thoughts/suggestions are welcomed.
Ralph
_______________________________________________
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
--
Jeff Squyres
Cisco Systems