Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600

Jeff Squyres Mon, 22 Sep 2008 16:54:31 -0400

I think the point is that as a group, we consciously, deliberately,and painfully decided not to support multi-cluster. And as a result,we ripped out a lot of supporting code. Starting down this path againwill likely result in a) re-opening all the discussions, b) re-addinga lot of code (or code effectively similar to what was there before).Let's not forget that there were many unsolved problems surroundingmulti-cluster last time, too.

It was also pointed out in Ralph's mails that, at least from thedescriptions provided, adding the field in orte_node_t does notactually solve the problem that ORNL is trying to solve.

If we, as a group, decide to re-add all this stuff, then a) recognizethat we are flip-flopping *again* on this issue, and b) it will take alot of coding effort to do so. I do think that since this was a groupdecision last time, it should be a group decision this time, too. Ifthis does turn out to be as large of a sub-project as described, Iwould be opposed to the development occurring on the trunk; hg treesare perfect for this kind of stuff.

I personally have no customers who are doing cross-cluster kinds ofthings, so I don't personally care if cross-cluster functionalityworks its way [back] in. But I recognize that OMPI core members areinvestigating it. So the points I'm making are procedural; I have noreal dog in this fight...



On Sep 22, 2008, at 4:40 PM, George Bosilca wrote:

Ralph,
There is NO need to have this discussion again, it was painfulenough last time. From my perspective I do not understand why areyou making so much noise on this one. How a 4 lines change in someALPS specific files (Cray system very specific to ORNL) can generatemore than 3 A4 pages of emails, is still something out of myperception.
If they want to do multi-cluster and they do not break anything inORTE/OMPI and they do not ask other people to do it for them whytrying to stop them ?
 george.

On Sep 22, 2008, at 3:59 PM, Ralph Castain wrote:
There was a very long drawn-out discussion about this early in2007. Rather than rehash all that, I'll try to summarize it here.It may get confusing - it helped a whole lot to be in a room with awhiteboard. There were also presentations on the subject - Ibelieve the slides may still be in the docs repository.
Because terminology quickly gets confusing, we adopted a slightlydifferent one for these discussions. We talk about OMPI being a"single cell" system - i.e., jobs executed via mpirun can only spannodes that are reachable by that mpirun. In a typical managedenvironment, a cell aligns quite well with a "cluster". In anunmanaged environment where the user provides a hostfile, the cellwill contain all nodes specified in the hostfile.
We don't filter or abort for non-matching hostnames - if mpirun canlaunch on that node, then great. What we don't support is askingmpirun to remotely execute another mpirun on the frontend ofanother cell in order to launch procs on the nodes in -that- cell,nor do we ask mpirun to in any way manage (or even know about) anyprocs running on a remote cell.
I see what you are saying about the ALPS node name. However, thefield you want to add doesn't have anything to do with accept/connect. The orte_node_t object is used solely by mpirun to keeptrack of the node pool it controls - i.e., the nodes upon which itis launching jobs. Thus, the mpirun on cluster A will have"nidNNNN" entries it got from its allocation, and the mpirun oncluster B will have "nidNNNN" entries it got from its allocation -but the two mpiruns will never exchange that information, nor willthe mpirun on cluster A ever have a need to know the node entriesfor cluster B. Each mpirun launches and manages procs -only- on thenodes in its own allocation.
I agree you will have issues when doing the connect/accept modex asthe nodenames are exchanged and are no longer unique in yourscenario. However, that info stays in the ompi_proc_t - it nevergets communicated to the ORTE layer as we couldn't care less downthere about the remote procs since they are under the control of adifferent mpirun. So if you need to add a cluster id field for thispurpose, it needs to go in ompi_proc_t - not in the orte structures.
And for that, you probably need to discuss it with the MPI team aschanges to ompi_proc_t will likely generate considerable discussion.
FWIW: this is one reason I warned Galen about the problems inreviving multi-cluster operations again. We used to deal with multi-cells in the process name itself, but all that support has beenremoved from OMPI.
Hope that helps
Ralph

On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote:
I may be opening a can of worms...

But, what prevents a user from running across clusters in a "normal
OMPI", i.e., non-ALPS environment?  When he puts hosts into his
hostfile, does it parse and abort/filter non-matching hostnames?Theproblem for ALPS based systems is that nodes are addressed viaNID,PID
pairs at the portals level.  Thus, these are unique only within a
cluster. In point of fact, I could rewrite all of the ALPSsupport toidentify the nodes by "cluster_id".NID. It would be a bitinefficient
within a cluster because, we would have to extract the NID from this
syntax as we go down to the portals layer.  It also would lead to a
larger degree of change within the OMPI ALPS code base. However,I cangive ALPS-based systems the same feature set as the rest of theworld.
It just is more efficient to use an additional pointer in the
orte_node_t structure and results is a far simpler codestructure. This
makes it easier to maintain.
The only thing that "this change" really does is to identify theclusterunder which the ALPS allocation is made. If you are addressing anodein another cluster, (e.g., via accept/connect), the clustername/NID pair
is unique for ALPS as a hostname on a cluster node is unique between
clusters. If you do a gethostname() on a normal cluster node, youare
going to get mynameNNNNN, or something similar.  If you do a
gethostname() on an ALPS node, you are going to get nidNNNNN;there is
no differentiation between cluster A and cluster B.
Perhaps, my earlier comment was not accurate. In reality, itprovidesthe same degree of identification for ALPS nodes as hostnameprovidesfor normal clusters. From your perspective, it is immaterial thatit
also would allow us to support our limited form of multi-cluster
support. However, of and by itself, it only provides the samelevel of
identification as is done for other cluster nodes.
--
Ken


-----Original Message-----
From: Ralph Castain [mailto:r...@lanl.gov]
Sent: Monday, September 22, 2008 2:33 PM
To: Open MPI Developers
Cc: Matney Sr, Kenneth D.
Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600

The issue isn't with adding a string. The question is whether or not
OMPI is to support one job running across multiple clusters. Wemade a
conscious decision (after lengthy discussions on OMPI core and ORTE
mailing lists, plus several telecons) to not do so - we require that
the job execute on a single cluster, while allowing connect/acceptto
occur between jobs on different clusters.

It is difficult to understand why we need a string (or our old "cell
id") to tell us which cluster we are on if we are only followingthatoperating model. From the commit comment, and from what I know ofthe
system, the only rationale for adding such a designator is to shift
back to the one-mpirun-spanning-multiple-cluster model.

If we are now going to make that change, then it merits a similar
level of consideration as the last decision to move away from that
model. Making that move involves considerably more than justadding a
cluster id string. You may think that now, but the next step is
inevitably to bring back remote launch, killing jobs on all clusters
when one cluster has a problem, etc.

Before we go down this path and re-open Pandora's box, we should at
least agree that is what we intend to do...or agree on what hard
constraints we will place on multi-cluster operations. Frankly, I'm
tired of bouncing back-and-forth on even the most basic design
decisions.

Ralph



On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:
What Ken put in is what is needed for the limited multi-cluster
capabilities
we need, just one additional string.  I don't think there is a need
for any
discussion of such a small change.

Rich


On 9/22/08 1:32 PM, "Ralph Castain" <r...@lanl.gov> wrote:
We really should discuss that as a group first - there is quitea bit
of code required to actually support multi-clusters that has been
removed.

Our operational model that was agreed to quite a while ago is that
mpirun can -only- extend over a single "cell". You can connect/acceptmultiple mpiruns that are sitting on different cells, but youcannot
execute a single mpirun across multiple cells.
Please keep this on your own development branch for now.Bringing itinto the trunk will require discussion as this changes theoperatingmodel, and has significant code consequences when we look atabnormal
terminations, comm_spawn, etc.

Thanks
Ralph

On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
This check in was in error - I had not realized that the checkout
was from
the 1.3 branch, so we will fix this, and put these into the trunk
(1.4).  We
are going to bring in some limited multi-cluster support -limited
is the
operative word.

Rich


On 9/22/08 12:50 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
I notice that Ken Matney (the committer) is not on the devel
list; I
added him explicitly to the CC line.

Ken: please see below.


On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
Whoa! We made a decision NOT to support multi-cluster apps inOMPI
over a year ago!
Please remove this from 1.3 - we should discuss if/when thiswould
even be allowed in the trunk.

Thanks
Ralph

On Sep 22, 2008, at 10:35 AM, mat...@osl.iu.edu wrote:
Author: matney
Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
New Revision: 19600
URL: https://svn.open-mpi.org/trac/ompi/changeset/19600

Log:
Added member to orte_node_t to enable multi-cluster jobs inALPS
scheduled systems (like Cray XT).

Text files modified:
branches/v1.3/orte/runtime/orte_globals.h |     4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

Modified: branches/v1.3/orte/runtime/orte_globals.h
=
=
=
=
=
=
=
=
=
=
=
=
=
=================================================================
--- branches/v1.3/orte/runtime/orte_globals.h (original)
+++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-2212:35:54
EDT (Mon, 22 Sep 2008)
@@ -222,6 +222,10 @@
/** Username on this node, if specified */
char *username;
char *slot_list;
+ /** Clustername (machine name of cluster) on which thisnode
+        resides.  ALPS scheduled systems need this to enable
+        multi-cluster support.  */
+    char *clustername;
} orte_node_t;
ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t);

_______________________________________________
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600

Reply via email to