Re: [OMPI devel] More memory troubles with vapi

2007-08-24 Thread Jeff Squyres

On Aug 24, 2007, at 4:18 PM, Josh Aune wrote:


We are using open-mpi on several 1000+ node clusters.  We received
several new clusters using the Infiniserve 3.X software stack recently
and are having several problems with the vapi btl (yes, I know, it is
very very old and shouldn't be used.  I couldn't agree with you more
but those are my marching orders).


Thankfully, Infiniserve is not within my prevue.  But -- FWIW -- you  
should be using OFED.  :-)  (I know you know)



I have a new application that is running into swap for an unknown
reason.  If I run and force it to use the tcp btl I don't seem to run
into swap (the job just takes a very very long time).  I have tried
restricting the size of the free lists, forcing to use send mode, and
use an open-mpi compiled w/ no memory manager but nothing seems to
help.  I've profiled with valgrind --tool=massif and the memtrace
capabilities of ptmalloc but I don't have any smoking guns yet.  It is
a fortran app an I don't know anything about debugging fortran memory
problems, can someone point me in the proper direction?


Hmm.  If you compile Open MPI with no memory manager, then it  
*shouldn't* be Open MPI's fault (unless there's a leak in the mvapi  
BTL...?).  Verify that you did not actually compile Open MPI with a  
memory manager by running "ompi_info| grep ptmalloc2" -- it should  
come up empty.


The fact that you can run this under TCP without memory leaking would  
seem to indicate that it's not the app that's leaking memory, but  
rather either the MPI or the network stack.


--
Jeff Squyres
Cisco Systems



[OMPI devel] More memory troubles with vapi

2007-08-24 Thread Josh Aune
We are using open-mpi on several 1000+ node clusters.  We received
several new clusters using the Infiniserve 3.X software stack recently
and are having several problems with the vapi btl (yes, I know, it is
very very old and shouldn't be used.  I couldn't agree with you more
but those are my marching orders).

I have a new application that is running into swap for an unknown
reason.  If I run and force it to use the tcp btl I don't seem to run
into swap (the job just takes a very very long time).  I have tried
restricting the size of the free lists, forcing to use send mode, and
use an open-mpi compiled w/ no memory manager but nothing seems to
help.  I've profiled with valgrind --tool=massif and the memtrace
capabilities of ptmalloc but I don't have any smoking guns yet.  It is
a fortran app an I don't know anything about debugging fortran memory
problems, can someone point me in the proper direction?

Thanks,
Josh


Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

2007-08-24 Thread Doug Tody
On Fri, 24 Aug 2007, George Bosilca wrote:

> On Aug 24, 2007, at 9:50 AM, Tim Prins wrote:
> > I do not understand why a user should have to use a RTE which supports
> > every system ever imagined, and provides every possible fault-tolerant
> > feature, when all they want is a thin RTE.
> 
> We have all the ingredients to make a this RTE layer, i.e. loadable  
> modules. The approach we proposed few months ago, to load a component  
> only when we know it will be needed give us a very slim RTE (once  
> applied everywhere it make sense). The biggest problem I see here is  
> that we will start scattering our efforts on multiple things instead  
> of working together to make what we have right now the best it can be.

I'm all for focusing effort on ORTE and making it the best it can
be, but it would seem that a more formalized component-framework
interface between the MPI layer and all of ORTE could potentially
help to achieve this.

What would be ideal would be if the OpenMPI project could define
such an interface, and also provide and support a standard reference
version of ORTE which implements this functionality.  This could
provide the OpenMPI project with the minimal/stable run time layer it
needs, but at the same time make it much easier for outside projects
with other requirements to experiment with enhanced versions of ORTE,
without having to worry about the impact on core OpenMPI development.
This need not splinter the effort, rather it might make it possible for
others outside the core OpenMPI development team to more effectively
contribute to and use OpenMPI and ORTE, in particular when it comes
to integration of the software into new environments.

- Doug

National Radio Astronomy Observatory (NRAO)
US National Virtual Observatory (NVO)


[MTT devel] Thoughts on tagging...

2007-08-24 Thread Jeff Squyres
I volunteered to do this on the call today.  Here's my thoughts on  
tagging:


1. From the client, it would be nice to be able to specify a comma- 
delimited list of tags at any phase.  Tags would be inherited by  
successive phases if not explicitly overridden.  E.g., if you specify  
a "foo" tag in an MPI get, it'll be used in all phases that use that  
MPI get.


Tags can be specified in one of three forms:

  +foo: means to *add* this tag to the existing/inherited set
  -foo: means to *remove* this tag from the existing/inherited set
  foo: if any tag does not have a +/- prefix, then the inherited set  
is cleared, effectively making the current set of tags be only the  
non-prefixed tags and +tags


For example:

  [MPI Get: AAA]
  # + and - have little meaning for MPI Get
  tags = foo, bar, baz

  [Test Get: BBB]
  # + and - have little meaning for Test Get
  tags = yar, fweezle, bozzle

  [Test Build: CCC]
  # Test build inherits tags from MPI Get and Test Get
  tags = +fa-schizzle, -yar
  # Resulting tag set: foo, bar, baz, fweezle, bozzle, fa-schizzle

  [Test build: DDD]
  # Override everything
  tags = yowza, gurple
  # Resulting tag set: yowza, gurple

2. For the reporter, I think we only want authenticated users to be  
able to create / manipulate tags.  Authentication can be via SVN  
username / password or the HTTPS submit username / password; I don't  
have strong preferences.


Anyone can query on tags, of course.

3. We should have easy "add these results to a tag" and "remove these  
results from a tag" operations, similar to GMail/labels.  I think the  
rule should be that if you can show MPI details (i.e., not the  
summary page), you can add/remove tags.  Perhaps something as simple  
as a text box with two buttons: Add tag, Remove tag.


3a. Example: you drill down to a set of test runs.  You type in "jeff  
results" in the text box and click the "add tag" button.  This adds  
the tag "jeff results" to all the result rows that are checked (it is  
not an error if the "jeff results" tag already exists on some/all of  
the result rows).


3b. Example: you drill down to a set of test runs.  You type in "jeff  
results" in the text box and click on the "remove tag" button.  This  
removes the tag "jeff results" from all the result rows that are  
checked (it is not an error if the jeff results" tag is not on some/ 
all of the result rows).


4. Per Gmail index label listing, it would be nice to see a list of  
tags that exist on a given result row.  It could be as simple as  
adding another show/hide column for the tags on a given result row.   
But it gets a little more complicated because one row many represent  
many different results -- do we show the union of tags for all the  
rollup rows?  Maybe we can use different colors / attributes to  
represent "this tag exists on *some* of the results in this row" vs.  
"this tag exists on *all* of the results in this row"...?


4a. If the tags are listed as a column, they should also (of course)  
be clickable so that if you click on them, you get the entire set of  
results associated with that tag.


4b. For every tag on a rollup row, it would be good to be able to say  
"apply this tag to every result in this rollup row" (i.e., this tag  
had previously only applied to *some* of the results in this rollup  
row).  This could be displayed as a little "+" icon next to the tag  
name, or somesuch.


4c. Similarly, for every tag, it would be good to have a "remove this  
tag from every result in this row".  This could be displayed as a  
little "-" icon next to the tag name, or somesuch.


4d. Care would need to be taken to ensure that users would not  
accidentally click on "+" or "-" icons next to tag names, however.


5. There should also be a simple way to:
   - see all available tags (perhaps including some kind of  
indication of how many results that tag represents)

   - completely delete a tag from the entire database

6. Tags may span multiple phases (install, build, run).  If you click  
on a tag that contains results on all three phases, what should  
happen?  I think it should be context-sensitive:
   - If you are in a summary environment, you get a summary table  
showing all relevant results.
   - If you are in a single phase environment, you see only the  
results in that phase (perhaps with a clickable icon to see the  
entire summary table with all the tag's results).


7. Lots of things can, by default, become tags.  E.g., org name and  
platform name can become default tags.  I.e., results that are  
submitted will automatically have the org name and platform name  
added to the results as tags.


--
Jeff Squyres
Cisco Systems



[OMPI devel] Better web searching of mail archives

2007-08-24 Thread Jeff Squyres
Thanks to some great integration work from the Indiana University  
sysadmin DongInn Kim, the Open MPI web site now features much better  
web searching for all the Open MPI (and related) mailing lists.


If you visit any of the mailing list archive index pages (e.g.,  
http://www.open-mpi.org/community/lists/users/), you'll see a new set  
of search functionality controls that allow flexible searching on  
specific date ranges, and subject/body/poster strings.  The "search"  
box on the top right of any message page is the same as using the new  
search controls with the default options selected.


We have found that this search capability tends to provide more  
detailed results than the old "search the mail archives via Google"  
method, mainly because the search tool (www.swish-e.org) understands  
that these are archived e-mail messages and displays its results  
appropriately.


We hope you find this new functionality useful.  Enjoy!

--
Jeff Squyres
Cisco Systems



[OMPI devel] Fwd: [MTT users] MTT Database and Reporter Upgrade **Action Required**

2007-08-24 Thread Jeff Squyres
FYI.  The MTT database will be down for a few hours on Monday  
morning.  It'll be replaced with a much mo'better version -- [much]  
faster than it was before.  Details below.



Begin forwarded message:


From: Josh Hursey 
Date: August 24, 2007 1:37:18 PM EDT
To: General user list for the MPI Testing Tool 
Subject: [MTT users] MTT Database and Reporter Upgrade **Action  
Required**
Reply-To: General user list for the MPI Testing Tool us...@open-mpi.org>


Short Version:
--
The MTT development group is rolling out newly optimized web frontend
and backend database. As a result we will be taking down the MTT site
at IU Monday, August 27 from 8 am to Noon US eastern time.

During this time you will not be able to submit data to the MTT
database. Therefore you need to disable any runs that will report
during this time or your client will fail with unable to connect to
server messages.

This change does not affect the client configurations, so MTT users
do *not* need to update their clients at this time.


Longer Version:
---
The MTT development team has been working diligently on server side
optimizations over the past few months. This work involved major
changes to the database schema, web reporter, and web submit
components of the server.

We want to roll out the new server side optimizations on Monday, Aug.
27. Given the extensive nature of the improvements the MTT server
will need to be taken down for a few hours for this upgrade to take
place. We are planning on taking down the MTT server at 8 am and
we hope to have it back by Noon US Eastern time.

MTT users that would normally submit results during this time range
will need to disable their runs, or they will see server error
messages during this outage.

This upgrade does not require any client changes, so outside of the
down time contributors need not change or upgrade their MTT
installations.

Below are a few rough performance numbers illustrating the difference
between the old and new server versions as seen by the reporter.

Summary report: 24 hours, all orgs
 87 sec - old version
  6 sec - new version
Summary report: 24 hours, org = 'iu'
 37 sec - old
  4 sec - new
Summary report: Past 3 days, all orgs
138 sec - old
  9 sec - new
Summary report: Past 3 days, org = 'iu'
 49 sec - old
 11 sec - new
Summary report: Past 2 weeks, all orgs
863 sec - old
 34 sec - new
Summary report: Past 2 weeks, org = 'iu'
878 sec - old
 12 sec - new
Summary report: Past 1 month, all org
   1395 sec - old
158 sec - new
Summary report: Past 1 month, org = 'iu'
   1069 sec - old
 39 sec - new
Summary report: (2007-06-18 - 2007-06-19), all org
484 sec - old
  5 sec - new
Summary report: (2007-06-18 - 2007-06-19), org = 'iu'
479 sec - old
  2 sec - new

___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

2007-08-24 Thread George Bosilca


On Aug 24, 2007, at 9:50 AM, Tim Prins wrote:


Again, my main concern is about fault tolerance. There is nothing in
PMI (and nothing in RSL so far) that allow any kind of fault
tolerance [And believe me re-writing the MPICH mpirun to allow
checkpoint/restart is a hassle].
I am open to any extensions that are needed. Again, the current  
version

is designed as a starting point. Also, I have been talking a lot with
Josh and the current RSL is more than enough to support
checkpoint/restart as currently implemented. I would be interested in
talking about any additions that are needed.


Right, but that's a side effect. The coordinated checkpoint is not  
very intrusive, it only requires a limited set of capabilities, which  
are usually delivered by all RTE. However, if you look just a little  
bit further, uncoordinated checkpoint (where only one of the  
processes have to be restarted and join the others in their old  
"world"), you will notice that the current interface (RSL or PMI)  
will not support this.





Moreover, your approach seems to
open the possibility of having heterogeneous RTE (in terms of
features) which in my view is definitively the wrong approach.

Do you mean having different RTEs that support different features?
Personally I do not see this as a horrible thing. In fact, we already
deal with this problem, since different systems support different
things. For instance, we support comm_spawn on most systems, but  
not all.


This is again a side effect of the incapacity of the underlying  
systems of providing the most elementary features we need. But, with  
ORTE at least we have the potential to overcome these limitations.



I do not understand why a user should have to use a RTE which supports
every system ever imagined, and provides every possible fault-tolerant
feature, when all they want is a thin RTE.


We have all the ingredients to make a this RTE layer, i.e. loadable  
modules. The approach we proposed few months ago, to load a component  
only when we know it will be needed give us a very slim RTE (once  
applied everywhere it make sense). The biggest problem I see here is  
that we will start scattering our efforts on multiple things instead  
of working together to make what we have right now the best it can be.


  george.



Tim



   george.

On Aug 16, 2007, at 9:47 PM, Tim Prins wrote:

WHAT: Solicitation of feedback on the possibility of adding a  
runtime

services layer to Open MPI to abstract out the runtime.

WHY: To solidify the interface between OMPI and the runtime
environment,
and to allow the use of different runtime systems, including  
different

versions of ORTE.

WHERE: Addition of a new framework to OMPI, and changes to many  
of the

files in OMPI to funnel all runtime request through this framework.
Few
changes should be required in OPAL and ORTE.

WHEN: Development has started in tmp/rsl, but is still in its
infancy. We hope
to have a working system in the next month.

TIMEOUT: 8/29/07

--
Short version:

I am working on creating an interface between OMPI and the runtime
system.
This would make a RSL framework in OMPI which all runtime services
would be
accessed from. Attached is a graphic depicting this.

This change would be invasive to the OMPI layer. Few (if any)  
changes

will be required of the ORTE and OPAL layers.

At this point I am soliciting feedback as to whether people are
supportive or not of this change both in general and for v1.3.


Long version:

The current model used in Open MPI assumes that one runtime  
system is
the best for all environments. However, in many environments it  
may be

beneficial to have specialized runtime systems. With our current
system this
is not easy to do.

With this in mind, the idea of creating a 'runtime services  
layer' was

hatched. This would take the form of a framework within OMPI,
through which
all runtime functionality would be accessed. This would allow new or
different runtime systems to be used with Open MPI. Additionally,
with such a
system it would be possible to have multiple versions of open rte
coexisting,
which may facilitate development and testing. Finally, this would
solidify the
interface between OMPI and the runtime system, as well as provide
documentation and side effects of each interface function.

However, such a change would be fairly invasive to the OMPI  
layer, and

needs a buy-in from everyone for it to be possible.

Here is a summary of the changes required for the RSL (at least how
it is
currently envisioned):

1. Add a framework to ompi for the rsl, and a component to support
orte.
2. Change ompi so that it uses the new interface. This involves:
 a. Moving runtime specific code into the orte rsl  
component.

 b. Changing the process names in ompi to an opaque object.
 c. change all references to orte in ompi to be to the rsl.
3. Change the configuration code so that open-rte is only linked
where needed.

Of course, all this would happen 

Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

2007-08-24 Thread Brian Barrett

On Aug 24, 2007, at 9:08 AM, George Bosilca wrote:


By heterogeneous RTE I was talking about what will happened once we
have the RSL. Different back-end will support different features, so
from the user perspective we will not provide a homogeneous execution
environment in all situations. On the other hand, focusing our
efforts in ORTE will guarantee this homogeneity in all cases.


Is this a good thing?  I think no, and we already don't have it.  On  
Cray, we don't use mpirun but yod.  Livermore wants us to use SLURM  
directly instead of our mpirun kludge.  Those are heterogeneous from  
the user perspective.  But are also what the user expects on those  
platforms.


Brian


Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

2007-08-24 Thread Tim Prins

George Bosilca wrote:
Looks like I'm the only one barely excited about this idea. The  
system that you described, is well known. It been around for around  
10 years, and it's called PMI. The interface you have in the tmp  
branch as well as the description you gave in your email are more  
than similar with what they sketch in the following two documents:


http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2draft.htm
http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2.htm
Yes, I am well acquainted with these documents, and the PMI did provide 
a lot of inspiration for the RSL.


Now, there is something wrong with reinventing the wheel if there are  
no improvements. And so far I'm unable to notice any major  
improvement neither compared with PMI nor with what we have today  
(except maybe being able to use PMI inside Open MPI).
This is true. The RSL is designed to handle exactly what we need right 
now. This does not mean that the interface cannot be extended later. The 
current RSL is a starting point.


Again, my main concern is about fault tolerance. There is nothing in  
PMI (and nothing in RSL so far) that allow any kind of fault  
tolerance [And believe me re-writing the MPICH mpirun to allow  
checkpoint/restart is a hassle].
I am open to any extensions that are needed. Again, the current version 
is designed as a starting point. Also, I have been talking a lot with 
Josh and the current RSL is more than enough to support 
checkpoint/restart as currently implemented. I would be interested in 
talking about any additions that are needed.


Moreover, your approach seems to  
open the possibility of having heterogeneous RTE (in terms of  
features) which in my view is definitively the wrong approach.
Do you mean having different RTEs that support different features? 
Personally I do not see this as a horrible thing. In fact, we already 
deal with this problem, since different systems support different 
things. For instance, we support comm_spawn on most systems, but not all.


I do not understand why a user should have to use a RTE which supports 
every system ever imagined, and provides every possible fault-tolerant 
feature, when all they want is a thin RTE.


Tim



   george.

On Aug 16, 2007, at 9:47 PM, Tim Prins wrote:


WHAT: Solicitation of feedback on the possibility of adding a runtime
services layer to Open MPI to abstract out the runtime.

WHY: To solidify the interface between OMPI and the runtime  
environment,

and to allow the use of different runtime systems, including different
versions of ORTE.

WHERE: Addition of a new framework to OMPI, and changes to many of the
files in OMPI to funnel all runtime request through this framework.  
Few

changes should be required in OPAL and ORTE.

WHEN: Development has started in tmp/rsl, but is still in its  
infancy. We hope

to have a working system in the next month.

TIMEOUT: 8/29/07

--
Short version:

I am working on creating an interface between OMPI and the runtime  
system.
This would make a RSL framework in OMPI which all runtime services  
would be

accessed from. Attached is a graphic depicting this.

This change would be invasive to the OMPI layer. Few (if any) changes
will be required of the ORTE and OPAL layers.

At this point I am soliciting feedback as to whether people are
supportive or not of this change both in general and for v1.3.


Long version:

The current model used in Open MPI assumes that one runtime system is
the best for all environments. However, in many environments it may be
beneficial to have specialized runtime systems. With our current  
system this

is not easy to do.

With this in mind, the idea of creating a 'runtime services layer' was
hatched. This would take the form of a framework within OMPI,  
through which

all runtime functionality would be accessed. This would allow new or
different runtime systems to be used with Open MPI. Additionally,  
with such a
system it would be possible to have multiple versions of open rte  
coexisting,
which may facilitate development and testing. Finally, this would  
solidify the

interface between OMPI and the runtime system, as well as provide
documentation and side effects of each interface function.

However, such a change would be fairly invasive to the OMPI layer, and
needs a buy-in from everyone for it to be possible.

Here is a summary of the changes required for the RSL (at least how  
it is

currently envisioned):

1. Add a framework to ompi for the rsl, and a component to support  
orte.

2. Change ompi so that it uses the new interface. This involves:
 a. Moving runtime specific code into the orte rsl component.
 b. Changing the process names in ompi to an opaque object.
 c. change all references to orte in ompi to be to the rsl.
3. Change the configuration code so that open-rte is only linked  
where needed.


Of course, all this would happen on a tmp branch.

The design of the rsl is not solidified. I 

Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

2007-08-24 Thread Terry D. Dontje

George Bosilca wrote:

Looks like I'm the only one barely excited about this idea. The  
system that you described, is well known. It been around for around  
10 years, and it's called PMI. The interface you have in the tmp  
branch as well as the description you gave in your email are more  
than similar with what they sketch in the following two documents:


http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2draft.htm
http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2.htm

Now, there is something wrong with reinventing the wheel if there are  
no improvements. And so far I'm unable to notice any major  
improvement neither compared with PMI nor with what we have today  
(except maybe being able to use PMI inside Open MPI).


 


I agree with the first sentence above.  I think this goes along
the line of Raplh's comment of "what are we trying to solve here?"
When this all started about 6 months ago I think the main concern
was finding what interfaces existed between ORTE and OMPI.  Though
I am not sure how that blossomed into redesigning the interface.
Not saying there isn't a reason to just that we should step back
and make sure we know why we are.

Again, my main concern is about fault tolerance. There is nothing in  
PMI (and nothing in RSL so far) that allow any kind of fault  
tolerance [And believe me re-writing the MPICH mpirun to allow  
checkpoint/restart is a hassle]. Moreover, your approach seems to  
open the possibility of having heterogeneous RTE (in terms of  
features) which in my view is definitively the wrong approach.


 


I am curious about this last paragraph.  Is it your belief that the current
ORTE does lend itself to being extended to incorporate fault tolerance?

Also, by heterogenous RTE are you meaning RTE running on a cluster
of heterogenous set of platforms?  If so, I would like to understand why
you think that is the "wrong" approach. 


--td


  george.

On Aug 16, 2007, at 9:47 PM, Tim Prins wrote:

 


WHAT: Solicitation of feedback on the possibility of adding a runtime
services layer to Open MPI to abstract out the runtime.

WHY: To solidify the interface between OMPI and the runtime  
environment,

and to allow the use of different runtime systems, including different
versions of ORTE.

WHERE: Addition of a new framework to OMPI, and changes to many of the
files in OMPI to funnel all runtime request through this framework.  
Few

changes should be required in OPAL and ORTE.

WHEN: Development has started in tmp/rsl, but is still in its  
infancy. We hope

to have a working system in the next month.

TIMEOUT: 8/29/07

--
Short version:

I am working on creating an interface between OMPI and the runtime  
system.
This would make a RSL framework in OMPI which all runtime services  
would be

accessed from. Attached is a graphic depicting this.

This change would be invasive to the OMPI layer. Few (if any) changes
will be required of the ORTE and OPAL layers.

At this point I am soliciting feedback as to whether people are
supportive or not of this change both in general and for v1.3.


Long version:

The current model used in Open MPI assumes that one runtime system is
the best for all environments. However, in many environments it may be
beneficial to have specialized runtime systems. With our current  
system this

is not easy to do.

With this in mind, the idea of creating a 'runtime services layer' was
hatched. This would take the form of a framework within OMPI,  
through which

all runtime functionality would be accessed. This would allow new or
different runtime systems to be used with Open MPI. Additionally,  
with such a
system it would be possible to have multiple versions of open rte  
coexisting,
which may facilitate development and testing. Finally, this would  
solidify the

interface between OMPI and the runtime system, as well as provide
documentation and side effects of each interface function.

However, such a change would be fairly invasive to the OMPI layer, and
needs a buy-in from everyone for it to be possible.

Here is a summary of the changes required for the RSL (at least how  
it is

currently envisioned):

1. Add a framework to ompi for the rsl, and a component to support  
orte.

2. Change ompi so that it uses the new interface. This involves:
a. Moving runtime specific code into the orte rsl component.
b. Changing the process names in ompi to an opaque object.
c. change all references to orte in ompi to be to the rsl.
3. Change the configuration code so that open-rte is only linked  
where needed.


Of course, all this would happen on a tmp branch.

The design of the rsl is not solidified. I have been playing in a  
tmp branch
(located at https://svn.open-mpi.org/svn/ompi/tmp/rsl) which  
everyone is

welcome to look at and comment on, but be advised that things here are
subject to change (I don't think it even compiles right now). There  
are

some fairly large open questions on this, 

Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

2007-08-24 Thread George Bosilca
Looks like I'm the only one barely excited about this idea. The  
system that you described, is well known. It been around for around  
10 years, and it's called PMI. The interface you have in the tmp  
branch as well as the description you gave in your email are more  
than similar with what they sketch in the following two documents:


http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2draft.htm
http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2.htm

Now, there is something wrong with reinventing the wheel if there are  
no improvements. And so far I'm unable to notice any major  
improvement neither compared with PMI nor with what we have today  
(except maybe being able to use PMI inside Open MPI).


Again, my main concern is about fault tolerance. There is nothing in  
PMI (and nothing in RSL so far) that allow any kind of fault  
tolerance [And believe me re-writing the MPICH mpirun to allow  
checkpoint/restart is a hassle]. Moreover, your approach seems to  
open the possibility of having heterogeneous RTE (in terms of  
features) which in my view is definitively the wrong approach.


  george.

On Aug 16, 2007, at 9:47 PM, Tim Prins wrote:


WHAT: Solicitation of feedback on the possibility of adding a runtime
services layer to Open MPI to abstract out the runtime.

WHY: To solidify the interface between OMPI and the runtime  
environment,

and to allow the use of different runtime systems, including different
versions of ORTE.

WHERE: Addition of a new framework to OMPI, and changes to many of the
files in OMPI to funnel all runtime request through this framework.  
Few

changes should be required in OPAL and ORTE.

WHEN: Development has started in tmp/rsl, but is still in its  
infancy. We hope

to have a working system in the next month.

TIMEOUT: 8/29/07

--
Short version:

I am working on creating an interface between OMPI and the runtime  
system.
This would make a RSL framework in OMPI which all runtime services  
would be

accessed from. Attached is a graphic depicting this.

This change would be invasive to the OMPI layer. Few (if any) changes
will be required of the ORTE and OPAL layers.

At this point I am soliciting feedback as to whether people are
supportive or not of this change both in general and for v1.3.


Long version:

The current model used in Open MPI assumes that one runtime system is
the best for all environments. However, in many environments it may be
beneficial to have specialized runtime systems. With our current  
system this

is not easy to do.

With this in mind, the idea of creating a 'runtime services layer' was
hatched. This would take the form of a framework within OMPI,  
through which

all runtime functionality would be accessed. This would allow new or
different runtime systems to be used with Open MPI. Additionally,  
with such a
system it would be possible to have multiple versions of open rte  
coexisting,
which may facilitate development and testing. Finally, this would  
solidify the

interface between OMPI and the runtime system, as well as provide
documentation and side effects of each interface function.

However, such a change would be fairly invasive to the OMPI layer, and
needs a buy-in from everyone for it to be possible.

Here is a summary of the changes required for the RSL (at least how  
it is

currently envisioned):

1. Add a framework to ompi for the rsl, and a component to support  
orte.

2. Change ompi so that it uses the new interface. This involves:
 a. Moving runtime specific code into the orte rsl component.
 b. Changing the process names in ompi to an opaque object.
 c. change all references to orte in ompi to be to the rsl.
3. Change the configuration code so that open-rte is only linked  
where needed.


Of course, all this would happen on a tmp branch.

The design of the rsl is not solidified. I have been playing in a  
tmp branch
(located at https://svn.open-mpi.org/svn/ompi/tmp/rsl) which  
everyone is

welcome to look at and comment on, but be advised that things here are
subject to change (I don't think it even compiles right now). There  
are

some fairly large open questions on this, including:

1. How to handle mpirun (that is, when a user types 'mpirun', do they
always get ORTE, or do they sometimes get a system specific  
runtime). Most
likely mpirun will always use ORTE, and alternative launching  
programs would

be used for other runtimes.
2. Whether there will be any performance implications. My guess is  
not,

but am not quite sure of this yet.

Again, I am interested in people's comments on whether they think  
adding
such abstraction is good or not, and whether it is reasonable to do  
such a

thing for v1.3.

Thanks,

Tim PrinsDiagram.pdf>___

devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core