Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

Ralph H Castain Tue, 28 Aug 2007 14:48:58 -0400

On 8/27/07 7:30 AM, "Tim Prins" <[email protected]> wrote:

> Ralph,
> 
> Ralph H Castain wrote:
>> Just returned from vacation...sorry for delayed response
> No Problem. Hope you had a good vacation :) And sorry for my super
> delayed response. I have been pondering this a bit.
> 
>> In the past, I have expressed three concerns about the RSL.
>> <snip>
>> 
>> My bottom line recommendation: I have no philosophical issue with the RSL
>> concept. However, I recommend holding off until the next version of ORTE is
>> completed and then re-evaluating to see how valuable the RSL might be, as
>> that next version will include memory footprint reduction and framework
>> consolidation that may yield much of the RSL's value without the extra work.
>> 
>> 
>> Long version:
>> 
>> 1. What problem are we really trying to solve?
>> If the RSL is intended to solve the Cray support problem (where the Cray OS
>> really just wants to see OMPI, not ORTE), then it may have some value. The
>> issue to date has revolved around the difficulty of maintaining the Cray
>> port in the face of changes to ORTE - as new frameworks are added, special
>> components for Cray also need to be created to provide a "do-nothing"
>> capability. In addition, the Cray is memory constrained, and the ORTE
>> library occupies considerable space while providing very little
>> functionality.
> This is definitely a motivation, but not the only one.

So...what are the others?

> 
>> The degree of value provide by the RSL will therefore depend somewhat on the
>> efficacy of the changes in development within ORTE. Those changes will,
>> among other things, significantly consolidate and reduce the number of
>> frameworks, and reduce the memory footprint. The expectation is that the
>> result will require only a single CNOS component in one framework. It isn't
>> clear, therefore, that the RSL will provide a significant value in that
>> environment.
> But won't there still be a lot of orte code linked in that will never be
> used?

Not really. The only thing left would be the stuff in runtime and util.

We have talked for years about creating an ORTE "services" framework -
basically, combining what is now in the runtime and util directories into a
single framework ala "svcs". The notion was that everything OS-specific
would go in there. What has held up implementation is (a) some thought that
maybe those things should go into OPAL instead of ORTE, and (b) low priority
and more important things to do.

However, if someone went ahead and implemented that idea, then you would
have a "NULL" component in the base that basically does a "no-op", and a
"default" component that provides actual services. Thus, for CNOS, you would
take the NULL component (so you don't open the framework's components and
avoid that memory overhead), and away you go.

I don't see how the RSL does anything better. Admittedly, you wouldn't have
to maintain the svcs APIs, but that doesn't seem any more onerous than
maintaining the RSL APIs as we change the MPI/RTE interfaces.

> 
> Also, a RSL would simplify ORTE in that there would be no need to do
> anything special for CNOs in it.

But if all I do is remove the ORTE cnos component and add an RSL cnos
component...what have I simplified?

> 
>> 
>> If the RSL is intended to aid in ORTE development, as hinted at in the RFC,
>> then I believe that is questionable. Developing ORTE in a tmp branch has
>> proven reasonably effective as changes to the MPI layer are largely
>> invisible to ORTE. Creating another layer to the system that would also have
>> to be maintained seems like a non-productive way of addressing any problems
>> in that area.
> Whether or not it would help in orte development remains to be seen. I
> just say that it might. Although I would argue that developing in tmp
> branches has caused a lot of problems with merging, etc.

Guess I don't see how this would solve the merge problems...but whatever.

> 
>> If the RSL is intended as a means of "freezing" the MPI-RTE interface, then
>> I believe we could better attain that objective by simply defining a set of
>> requirements for the RTE. As I'll note below, freezing the interface at an
>> API level could negatively impact other Open MPI objectives.
> It is intended to easily allow the development and use of other runtime
> systems, so simply defining requirements is not enough.

Could you please give some examples of these other runtimes?? Or is this
just hypothetical at this time?


> 
>> 2. Who is going to maintain old RTE versions, and why?
>> It isn't clear to me why anyone would want to do this - are we seriously
>> proposing that we maintain support for the ORTE layer that shipped with Open
>> MPI 1.0?? Can someone explain why we would want to do that?
> I highly doubt anyone would, and see no reason to include support for
> older runtime versions. Again, the purpose is to be able to run
> different runtimes. The ability to run different versions of the same
> runtime is just a side-effect.
> 
>> <snip>
>> 
>> 3. Are we constraining ourselves from further improvements in startup
>> performance?
>> This is my biggest area of concern. The RSL has been proposed as an
>> API-level definition. However, the MPI-RTE interaction really is defined in
>> terms of a flow-of-control - although each point of interaction is
>> instantiated as an API, the fact is that what happens at that point is not
>> independent of all prior interactions.
>> 
>> As an example of my concern, consider what we are currently doing with ORTE.
>> The latest change in requirements involves the need to significantly improve
>> startup time, reduce memory footprint, and reduce ORTE complexity. What we
>> are doing to meet that requirement is to review the delineation of
>> responsibilities between the MPI and RTE layers. The current delineation
>> evolved over time, with many of the decisions made at a very early point in
>> the program. For example, we instituted RTE-level stage gates in the MPI
>> layer because, at the time they were needed, the MPI developers didn't want
>> to deal with them on their side (e.g., ensuring that failure of one proc
>> wouldn't hang the system). Given today's level of maturity in the MPI layer,
>> we are now planning on moving the stage gates to the MPI layer, implemented
>> as an "all-to-all" - this will remove several thousand lines of code from
>> ORTE and make it easier for the MPI layer to operate on non-ORTE
>> environments.
>> 
>> Similar efforts are underway to reduce ORTE involvement in the modex
>> operation and other parts of the MPI application lifecycle. We are able to
>> do these things because we are now moving towards a tight integration of
>> ORTE and OMPI layers - i.e., ORTE can be simplified because we can take
>> advantage of our knowledge of what is happening on the MPI side of the
>> equation.
>> 
>> In order to accomplish this, however, we need to change the
>> points-of-contact between the MPI and RTE layers, and redefine what happens
>> at those points. If we require via the RSL that we" those points and what
>> happens at those points, then making these changes will either prove
>> impossible or at least will require considerable RSL code. On the other
>> hand, if we revise the RSL to support the new ORTE/OMPI functionality, then
>> we will have to write considerable code to make old versions of ORTE work
>> with the new system.
> Again, I am not particularly concerned with supporting older versions of
> orte, but rather supporting different runtime systems.
> 
> Also, from what I know of these changes (and perhaps I don't understand
> them), the proposed changes would fit into the current RSL design.

Well, we changed the STG3 stage gate to be an MPI call instead of an RTE
call. The STG2 stage gate is gone, replaced by an RTE barrier function. The
STG1 stage gate is going away, replaced by an RTE allgather. All
subscriptions and their associated trigger events are gone, replaced by
direct messaging where necessary (though most of them were simply eliminated
as we now let MPI do data exchanges itself). The spawn call in comm_spawn
has been cleaned up to eliminate lots of unused options - only remaining
things are translation of MPI_Info keys to their corresponding ORTE keys to
avoid abstraction violations, though we have talked about removing this as
well by having ORTE's spawn function read OPAL keyvals instead of ORTE
attributes (I'll probably do that as part of the next revision).

So it seems to me like there will be a few changes to the RSL API, and major
changes to where RSL calls go. As we continue to fine tune performance, I
expect this will continue to happen, hopefully with decreasing frequency as
we "home in" on a final solution.


> 
> Tim
> 
>> 
>> Hence, my concern is that we not let RSL implementation prevent us from
>> moving forward with ORTE. The current work is required to meet scaling
>> demands, and hopefully will resolve much of the Cray issue. I see no value
>> in creating RSL just to support old versions of ORTE, nor for supporting
>> ORTE development. It would be nice if we could re-evaluate this after the
>> next ORTE version becomes solidified to see how the cost/benefit analysis
>> has changed, and whether the RSL remains a desirable option.
>> 
>> Ralph
>> 
>> 
>> 
>> On 8/16/07 7:47 PM, "Tim Prins" <[email protected]> wrote:
>> 
>>> WHAT: Solicitation of feedback on the possibility of adding a runtime
>>> services layer to Open MPI to abstract out the runtime.
>>> 
>>> WHY: To solidify the interface between OMPI and the runtime environment,
>>> and to allow the use of different runtime systems, including different
>>> versions of ORTE.
>>> 
>>> WHERE: Addition of a new framework to OMPI, and changes to many of the
>>> files in OMPI to funnel all runtime request through this framework. Few
>>> changes should be required in OPAL and ORTE.
>>> 
>>> WHEN: Development has started in tmp/rsl, but is still in its infancy. We
>>> hope
>>> to have a working system in the next month.
>>> 
>>> TIMEOUT: 8/29/07
>>> 
>>> ------
>>> Short version:
>>> 
>>> I am working on creating an interface between OMPI and the runtime system.
>>> This would make a RSL framework in OMPI which all runtime services would be
>>> accessed from. Attached is a graphic depicting this.
>>> 
>>> This change would be invasive to the OMPI layer. Few (if any) changes
>>> will be required of the ORTE and OPAL layers.
>>> 
>>> At this point I am soliciting feedback as to whether people are
>>> supportive or not of this change both in general and for v1.3.
>>> 
>>> 
>>> Long version:
>>> 
>>> The current model used in Open MPI assumes that one runtime system is
>>> the best for all environments. However, in many environments it may be
>>> beneficial to have specialized runtime systems. With our current system this
>>> is not easy to do.
>>> 
>>> With this in mind, the idea of creating a 'runtime services layer' was
>>> hatched. This would take the form of a framework within OMPI, through which
>>> all runtime functionality would be accessed. This would allow new or
>>> different runtime systems to be used with Open MPI. Additionally, with such
>>> a
>>> system it would be possible to have multiple versions of open rte
>>> coexisting,
>>> which may facilitate development and testing. Finally, this would solidify
>>> the
>>> interface between OMPI and the runtime system, as well as provide
>>> documentation and side effects of each interface function.
>>> 
>>> However, such a change would be fairly invasive to the OMPI layer, and
>>> needs a buy-in from everyone for it to be possible.
>>> 
>>> Here is a summary of the changes required for the RSL (at least how it is
>>> currently envisioned):
>>> 
>>> 1. Add a framework to ompi for the rsl, and a component to support orte.
>>> 2. Change ompi so that it uses the new interface. This involves:
>>>          a. Moving runtime specific code into the orte rsl component.
>>>          b. Changing the process names in ompi to an opaque object.
>>>          c. change all references to orte in ompi to be to the rsl.
>>> 3. Change the configuration code so that open-rte is only linked where
>>> needed.
>>> 
>>> Of course, all this would happen on a tmp branch.
>>> 
>>> The design of the rsl is not solidified. I have been playing in a tmp branch
>>> (located at https://svn.open-mpi.org/svn/ompi/tmp/rsl) which everyone is
>>> welcome to look at and comment on, but be advised that things here are
>>> subject to change (I don't think it even compiles right now). There are
>>> some fairly large open questions on this, including:
>>> 
>>> 1. How to handle mpirun (that is, when a user types 'mpirun', do they
>>> always get ORTE, or do they sometimes get a system specific runtime). Most
>>> likely mpirun will always use ORTE, and alternative launching programs would
>>> be used for other runtimes.
>>> 2. Whether there will be any performance implications. My guess is not,
>>> but am not quite sure of this yet.
>>> 
>>> Again, I am interested in people's comments on whether they think adding
>>> such abstraction is good or not, and whether it is reasonable to do such a
>>> thing for v1.3.
>>> 
>>> Thanks,
>>> 
>>> Tim Prins
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel-core mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

Reply via email to