Jeff Squyres wrote:
> I think the point is that as a group, we consciously, deliberately,
> and painfully decided not to support multi-cluster. And as a
result,
> we ripped out a lot of supporting code. Starting down this path
again
> will likely result in a) re-opening all the discussions, b) re-
adding
> a lot of code (or code effectively similar to what was there
before).
> Let's not forget that there were many unsolved problems surrounding
> multi-cluster last time, too.
>
> It was also pointed out in Ralph's mails that, at least from the
> descriptions provided, adding the field in orte_node_t does not
> actually solve the problem that ORNL is trying to solve.
>
> If we, as a group, decide to re-add all this stuff, then a)
recognize
> that we are flip-flopping *again* on this issue, and b) it will
take a
> lot of coding effort to do so. I do think that since this was a
group
> decision last time, it should be a group decision this time,
too. If
> this does turn out to be as large of a sub-project as described, I
> would be opposed to the development occurring on the trunk; hg
trees
> are perfect for this kind of stuff.
>
> I personally have no customers who are doing cross-cluster kinds of
> things, so I don't personally care if cross-cluster functionality
> works its way [back] in. But I recognize that OMPI core members
are
> investigating it. So the points I'm making are procedural; I
have no
> real dog in this fight...
>
>
I agree with Jeff that this is perfect for an hg tree. Though I also
don't have a dog in this fight but I have a cat that would rather
stay
comfortably sleeping and not have someone step on its tail :-). In
other words knock yourself out but please don't destabilize the
trunk.
Of course that begs the question what happens when the hg tree is
done
and working?
--td
> On Sep 22, 2008, at 4:40 PM, George Bosilca wrote:
>
>> Ralph,
>>
>> There is NO need to have this discussion again, it was painful
enough
>> last time. From my perspective I do not understand why are you
making
>> so much noise on this one. How a 4 lines change in some ALPS
specific
>> files (Cray system very specific to ORNL) can generate more than
3 A4
>> pages of emails, is still something out of my perception.
>>
>> If they want to do multi-cluster and they do not break anything in
>> ORTE/OMPI and they do not ask other people to do it for them why
>> trying to stop them ?
>>
>> george.
>>
>> On Sep 22, 2008, at 3:59 PM, Ralph Castain wrote:
>>
>>> There was a very long drawn-out discussion about this early in
2007.
>>> Rather than rehash all that, I'll try to summarize it here. It
may
>>> get confusing - it helped a whole lot to be in a room with a
>>> whiteboard. There were also presentations on the subject - I
believe
>>> the slides may still be in the docs repository.
>>>
>>> Because terminology quickly gets confusing, we adopted a slightly
>>> different one for these discussions. We talk about OMPI being a
>>> "single cell" system - i.e., jobs executed via mpirun can only
span
>>> nodes that are reachable by that mpirun. In a typical managed
>>> environment, a cell aligns quite well with a "cluster". In an
>>> unmanaged environment where the user provides a hostfile, the
cell
>>> will contain all nodes specified in the hostfile.
>>>
>>> We don't filter or abort for non-matching hostnames - if mpirun
can
>>> launch on that node, then great. What we don't support is asking
>>> mpirun to remotely execute another mpirun on the frontend of
another
>>> cell in order to launch procs on the nodes in -that- cell, nor
do we
>>> ask mpirun to in any way manage (or even know about) any procs
>>> running on a remote cell.
>>>
>>> I see what you are saying about the ALPS node name. However, the
>>> field you want to add doesn't have anything to do with
>>> accept/connect. The orte_node_t object is used solely by mpirun
to
>>> keep track of the node pool it controls - i.e., the nodes upon
which
>>> it is launching jobs. Thus, the mpirun on cluster A will have
>>> "nidNNNN" entries it got from its allocation, and the mpirun on
>>> cluster B will have "nidNNNN" entries it got from its
allocation -
>>> but the two mpiruns will never exchange that information, nor
will
>>> the mpirun on cluster A ever have a need to know the node entries
>>> for cluster B. Each mpirun launches and manages procs -only- on
the
>>> nodes in its own allocation.
>>>
>>> I agree you will have issues when doing the connect/accept
modex as
>>> the nodenames are exchanged and are no longer unique in your
>>> scenario. However, that info stays in the ompi_proc_t - it never
>>> gets communicated to the ORTE layer as we couldn't care less down
>>> there about the remote procs since they are under the control
of a
>>> different mpirun. So if you need to add a cluster id field for
this
>>> purpose, it needs to go in ompi_proc_t - not in the orte
structures.
>>>
>>> And for that, you probably need to discuss it with the MPI team
as
>>> changes to ompi_proc_t will likely generate considerable
discussion.
>>>
>>> FWIW: this is one reason I warned Galen about the problems in
>>> reviving multi-cluster operations again. We used to deal with
>>> multi-cells in the process name itself, but all that support has
>>> been removed from OMPI.
>>>
>>> Hope that helps
>>> Ralph
>>>
>>> On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote:
>>>
>>>> I may be opening a can of worms...
>>>>
>>>> But, what prevents a user from running across clusters in a
"normal
>>>> OMPI", i.e., non-ALPS environment? When he puts hosts into his
>>>> hostfile, does it parse and abort/filter non-matching
hostnames? The
>>>> problem for ALPS based systems is that nodes are addressed via
NID,PID
>>>> pairs at the portals level. Thus, these are unique only
within a
>>>> cluster. In point of fact, I could rewrite all of the ALPS
support to
>>>> identify the nodes by "cluster_id".NID. It would be a bit
inefficient
>>>> within a cluster because, we would have to extract the NID
from this
>>>> syntax as we go down to the portals layer. It also would lead
to a
>>>> larger degree of change within the OMPI ALPS code base.
However, I
>>>> can
>>>> give ALPS-based systems the same feature set as the rest of
the world.
>>>> It just is more efficient to use an additional pointer in the
>>>> orte_node_t structure and results is a far simpler code
structure.
>>>> This
>>>> makes it easier to maintain.
>>>>
>>>> The only thing that "this change" really does is to identify the
>>>> cluster
>>>> under which the ALPS allocation is made. If you are
addressing a node
>>>> in another cluster, (e.g., via accept/connect), the
clustername/NID
>>>> pair
>>>> is unique for ALPS as a hostname on a cluster node is unique
between
>>>> clusters. If you do a gethostname() on a normal cluster node,
you are
>>>> going to get mynameNNNNN, or something similar. If you do a
>>>> gethostname() on an ALPS node, you are going to get nidNNNNN;
there is
>>>> no differentiation between cluster A and cluster B.
>>>>
>>>> Perhaps, my earlier comment was not accurate. In reality, it
provides
>>>> the same degree of identification for ALPS nodes as hostname
provides
>>>> for normal clusters. From your perspective, it is immaterial
that it
>>>> also would allow us to support our limited form of multi-cluster
>>>> support. However, of and by itself, it only provides the same
>>>> level of
>>>> identification as is done for other cluster nodes.
>>>> --
>>>> Ken
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Ralph Castain [mailto:r...@lanl.gov]
>>>> Sent: Monday, September 22, 2008 2:33 PM
>>>> To: Open MPI Developers
>>>> Cc: Matney Sr, Kenneth D.
>>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
>>>>
>>>> The issue isn't with adding a string. The question is whether
or not
>>>> OMPI is to support one job running across multiple clusters.
We made a
>>>> conscious decision (after lengthy discussions on OMPI core and
ORTE
>>>> mailing lists, plus several telecons) to not do so - we
require that
>>>> the job execute on a single cluster, while allowing connect/
accept to
>>>> occur between jobs on different clusters.
>>>>
>>>> It is difficult to understand why we need a string (or our old
"cell
>>>> id") to tell us which cluster we are on if we are only
following that
>>>> operating model. From the commit comment, and from what I know
of the
>>>> system, the only rationale for adding such a designator is to
shift
>>>> back to the one-mpirun-spanning-multiple-cluster model.
>>>>
>>>> If we are now going to make that change, then it merits a
similar
>>>> level of consideration as the last decision to move away from
that
>>>> model. Making that move involves considerably more than just
adding a
>>>> cluster id string. You may think that now, but the next step is
>>>> inevitably to bring back remote launch, killing jobs on all
clusters
>>>> when one cluster has a problem, etc.
>>>>
>>>> Before we go down this path and re-open Pandora's box, we
should at
>>>> least agree that is what we intend to do...or agree on what hard
>>>> constraints we will place on multi-cluster operations.
Frankly, I'm
>>>> tired of bouncing back-and-forth on even the most basic design
>>>> decisions.
>>>>
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:
>>>>
>>>>> What Ken put in is what is needed for the limited multi-cluster
>>>>> capabilities
>>>>> we need, just one additional string. I don't think there is
a need
>>>>> for any
>>>>> discussion of such a small change.
>>>>>
>>>>> Rich
>>>>>
>>>>>
>>>>> On 9/22/08 1:32 PM, "Ralph Castain" <r...@lanl.gov> wrote:
>>>>>
>>>>>> We really should discuss that as a group first - there is
quite a
>>>>>> bit
>>>>>> of code required to actually support multi-clusters that has
been
>>>>>> removed.
>>>>>>
>>>>>> Our operational model that was agreed to quite a while ago
is that
>>>>>> mpirun can -only- extend over a single "cell". You can
>>>>>> connect/accept
>>>>>> multiple mpiruns that are sitting on different cells, but
you cannot
>>>>>> execute a single mpirun across multiple cells.
>>>>>>
>>>>>> Please keep this on your own development branch for now.
Bringing it
>>>>>> into the trunk will require discussion as this changes the
operating
>>>>>> model, and has significant code consequences when we look at
>>>>>> abnormal
>>>>>> terminations, comm_spawn, etc.
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
>>>>>>
>>>>>>> This check in was in error - I had not realized that the
checkout
>>>>>>> was from
>>>>>>> the 1.3 branch, so we will fix this, and put these into the
trunk
>>>>>>> (1.4). We
>>>>>>> are going to bring in some limited multi-cluster support -
limited
>>>>>>> is the
>>>>>>> operative word.
>>>>>>>
>>>>>>> Rich
>>>>>>>
>>>>>>>
>>>>>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquy...@cisco.com>
wrote:
>>>>>>>
>>>>>>>> I notice that Ken Matney (the committer) is not on the devel
>>>>>>>> list; I
>>>>>>>> added him explicitly to the CC line.
>>>>>>>>
>>>>>>>> Ken: please see below.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>> Whoa! We made a decision NOT to support multi-cluster
apps in
>>>>>>>>> OMPI
>>>>>>>>> over a year ago!
>>>>>>>>>
>>>>>>>>> Please remove this from 1.3 - we should discuss if/when
this
>>>>>>>>> would
>>>>>>>>> even be allowed in the trunk.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Ralph
>>>>>>>>>
>>>>>>>>> On Sep 22, 2008, at 10:35 AM, mat...@osl.iu.edu wrote:
>>>>>>>>>
>>>>>>>>>> Author: matney
>>>>>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
>>>>>>>>>> New Revision: 19600
>>>>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/19600
>>>>>>>>>>
>>>>>>>>>> Log:
>>>>>>>>>> Added member to orte_node_t to enable multi-cluster jobs
in ALPS
>>>>>>>>>> scheduled systems (like Cray XT).
>>>>>>>>>>
>>>>>>>>>> Text files modified:
>>>>>>>>>> branches/v1.3/orte/runtime/orte_globals.h | 4 ++++
>>>>>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>>>>>>>
>>>>>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>>
=================================================================
>>>>>>>>>>
>>>>>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original)
>>>>>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22
>>>>>>>>>> 12:35:54
>>>>>>>>>> EDT (Mon, 22 Sep 2008)
>>>>>>>>>> @@ -222,6 +222,10 @@
>>>>>>>>>> /** Username on this node, if specified */
>>>>>>>>>> char *username;
>>>>>>>>>> char *slot_list;
>>>>>>>>>> + /** Clustername (machine name of cluster) on which
this
>>>>>>>>>> node
>>>>>>>>>> + resides. ALPS scheduled systems need this to
enable
>>>>>>>>>> + multi-cluster support. */
>>>>>>>>>> + char *clustername;
>>>>>>>>>> } orte_node_t;
>>>>>>>>>> ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t);
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> svn mailing list
>>>>>>>>>> s...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel