Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600

Jeff Squyres Wed, 24 Sep 2008 09:53:35 -0400

It sounds like this is worth larger discussion; a phone chat wouldprobably be the quickest way to resolve whether this is the tip of theiceberg (which I think is what Ralph's concern is) or this one changeis really all that's needed (although Ralph indicated that it didn'tseem to be in the right place...?).


On Sep 23, 2008, at 10:12 AM, Richard Graham wrote:

Let me make the point that adding a data structure is much lessdestabilization to the tree than the routine day-to-day changes thatgo on in the tree.
Rich


On 9/23/08 6:24 AM, "Terry D. Dontje" <terry.don...@sun.com> wrote:
Jeff Squyres wrote:
> I think the point is that as a group, we consciously, deliberately,
> and painfully decided not to support multi-cluster. And as aresult,> we ripped out a lot of supporting code. Starting down this pathagain> will likely result in a) re-opening all the discussions, b) re-adding> a lot of code (or code effectively similar to what was therebefore).
> Let's not forget that there were many unsolved problems surrounding
> multi-cluster last time, too.
>
> It was also pointed out in Ralph's mails that, at least from the
> descriptions provided, adding the field in orte_node_t does not
> actually solve the problem that ORNL is trying to solve.
>
> If we, as a group, decide to re-add all this stuff, then a)recognize> that we are flip-flopping *again* on this issue, and b) it willtake a> lot of coding effort to do so. I do think that since this was agroup> decision last time, it should be a group decision this time,too. If
> this does turn out to be as large of a sub-project as described, I
> would be opposed to the development occurring on the trunk; hgtrees
> are perfect for this kind of stuff.
>
> I personally have no customers who are doing cross-cluster kinds of
> things, so I don't personally care if cross-cluster functionality
> works its way [back] in. But I recognize that OMPI core membersare> investigating it. So the points I'm making are procedural; Ihave no
> real dog in this fight...
>
>
I agree with Jeff that this is perfect for an hg tree.  Though I also
don't have a dog in this fight but I have a cat that would ratherstay
comfortably sleeping and not have someone step on its tail :-).   In
other words knock yourself out but please don't destabilize thetrunk.Of course that begs the question what happens when the hg tree isdone
and working?

--td

> On Sep 22, 2008, at 4:40 PM, George Bosilca wrote:
>
>> Ralph,
>>
>> There is NO need to have this discussion again, it was painfulenough>> last time. From my perspective I do not understand why are youmaking>> so much noise on this one. How a 4 lines change in some ALPSspecific>> files (Cray system very specific to ORNL) can generate more than3 A4
>> pages of emails, is still something out of my perception.
>>
>> If they want to do multi-cluster and they do not break anything in
>> ORTE/OMPI and they do not ask other people to do it for them why
>> trying to stop them ?
>>
>>  george.
>>
>> On Sep 22, 2008, at 3:59 PM, Ralph Castain wrote:
>>
>>> There was a very long drawn-out discussion about this early in2007.>>> Rather than rehash all that, I'll try to summarize it here. Itmay
>>> get confusing - it helped a whole lot to be in a room with a
>>> whiteboard. There were also presentations on the subject - Ibelieve
>>> the slides may still be in the docs repository.
>>>
>>> Because terminology quickly gets confusing, we adopted a slightly
>>> different one for these discussions. We talk about OMPI being a
>>> "single cell" system - i.e., jobs executed via mpirun can onlyspan
>>> nodes that are reachable by that mpirun. In a typical managed
>>> environment, a cell aligns quite well with a "cluster". In an
>>> unmanaged environment where the user provides a hostfile, thecell
>>> will contain all nodes specified in the hostfile.
>>>
>>> We don't filter or abort for non-matching hostnames - if mpiruncan
>>> launch on that node, then great. What we don't support is asking
>>> mpirun to remotely execute another mpirun on the frontend ofanother>>> cell in order to launch procs on the nodes in -that- cell, nordo we
>>> ask mpirun to in any way manage (or even know about) any procs
>>> running on a remote cell.
>>>
>>> I see what you are saying about the ALPS node name. However, the
>>> field you want to add doesn't have anything to do with
>>> accept/connect. The orte_node_t object is used solely by mpirunto>>> keep track of the node pool it controls - i.e., the nodes uponwhich
>>> it is launching jobs. Thus, the mpirun on cluster A will have
>>> "nidNNNN" entries it got from its allocation, and the mpirun on
>>> cluster B will have "nidNNNN" entries it got from itsallocation ->>> but the two mpiruns will never exchange that information, norwill
>>> the mpirun on cluster A ever have a need to know the node entries
>>> for cluster B. Each mpirun launches and manages procs -only- onthe
>>> nodes in its own allocation.
>>>
>>> I agree you will have issues when doing the connect/acceptmodex as
>>> the nodenames are exchanged and are no longer unique in your
>>> scenario. However, that info stays in the  ompi_proc_t - it never
>>> gets communicated to the ORTE layer as we couldn't care less down
>>> there about the remote procs since they are under the controlof a>>> different mpirun. So if you need to add a cluster id field forthis>>> purpose, it needs to go in ompi_proc_t - not in the ortestructures.
>>>
>>> And for that, you probably need to discuss it with the MPI teamas>>> changes to ompi_proc_t will likely generate considerablediscussion.
>>>
>>> FWIW: this is one reason I warned Galen about the problems in
>>> reviving multi-cluster operations again. We used to deal with
>>> multi-cells in the process name itself, but all that support has
>>> been removed from OMPI.
>>>
>>> Hope that helps
>>> Ralph
>>>
>>> On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote:
>>>
>>>> I may be opening a can of worms...
>>>>
>>>> But, what prevents a user from running across clusters in a"normal
>>>> OMPI", i.e., non-ALPS environment?  When he puts hosts into his
>>>> hostfile, does it parse and abort/filter non-matchinghostnames? The>>>> problem for ALPS based systems is that nodes are addressed viaNID,PID>>>> pairs at the portals level. Thus, these are unique onlywithin a>>>> cluster. In point of fact, I could rewrite all of the ALPSsupport to>>>> identify the nodes by "cluster_id".NID. It would be a bitinefficient>>>> within a cluster because, we would have to extract the NIDfrom this>>>> syntax as we go down to the portals layer. It also would leadto a>>>> larger degree of change within the OMPI ALPS code base.However, I
>>>> can
>>>> give ALPS-based systems the same feature set as the rest ofthe world.
>>>> It just is more efficient to use an additional pointer in the
>>>> orte_node_t structure and results is a far simpler codestructure.
>>>> This
>>>> makes it easier to maintain.
>>>>
>>>> The only thing that "this change" really does is to identify the
>>>> cluster
>>>> under which the ALPS allocation is made. If you areaddressing a node>>>> in another cluster, (e.g., via accept/connect), theclustername/NID
>>>> pair
>>>> is unique for ALPS as a hostname on a cluster node is uniquebetween>>>> clusters. If you do a gethostname() on a normal cluster node,you are
>>>> going to get mynameNNNNN, or something similar.  If you do a
>>>> gethostname() on an ALPS node, you are going to get nidNNNNN;there is
>>>> no differentiation between cluster A and cluster B.
>>>>
>>>> Perhaps, my earlier comment was not accurate. In reality, itprovides>>>> the same degree of identification for ALPS nodes as hostnameprovides>>>> for normal clusters. From your perspective, it is immaterialthat it
>>>> also would allow us to support our limited form of multi-cluster
>>>> support.  However, of and by itself, it only provides the same
>>>> level of
>>>> identification as is done for other cluster nodes.
>>>> --
>>>> Ken
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Ralph Castain [mailto:r...@lanl.gov]
>>>> Sent: Monday, September 22, 2008 2:33 PM
>>>> To: Open MPI Developers
>>>> Cc: Matney Sr, Kenneth D.
>>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
>>>>
>>>> The issue isn't with adding a string. The question is whetheror not>>>> OMPI is to support one job running across multiple clusters.We made a>>>> conscious decision (after lengthy discussions on OMPI core andORTE>>>> mailing lists, plus several telecons) to not do so - werequire that>>>> the job execute on a single cluster, while allowing connect/accept to
>>>> occur between jobs on different clusters.
>>>>
>>>> It is difficult to understand why we need a string (or our old"cell>>>> id") to tell us which cluster we are on if we are onlyfollowing that>>>> operating model. From the commit comment, and from what I knowof the>>>> system, the only rationale for adding such a designator is toshift
>>>> back to the one-mpirun-spanning-multiple-cluster model.
>>>>
>>>> If we are now going to make that change, then it merits asimilar>>>> level of consideration as the last decision to move away fromthat>>>> model. Making that move involves considerably more than justadding a
>>>> cluster id string. You may think that now, but the next step is
>>>> inevitably to bring back remote launch, killing jobs on allclusters
>>>> when one cluster has a problem, etc.
>>>>
>>>> Before we go down this path and re-open Pandora's box, weshould at
>>>> least agree that is what we intend to do...or agree on what hard
>>>> constraints we will place on multi-cluster operations.Frankly, I'm
>>>> tired of bouncing back-and-forth on even the most basic design
>>>> decisions.
>>>>
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:
>>>>
>>>>> What Ken put in is what is needed for the limited multi-cluster
>>>>> capabilities
>>>>> we need, just one additional string. I don't think there isa need
>>>>> for any
>>>>> discussion of such a small change.
>>>>>
>>>>> Rich
>>>>>
>>>>>
>>>>> On 9/22/08 1:32 PM, "Ralph Castain" <r...@lanl.gov> wrote:
>>>>>
>>>>>> We really should discuss that as a group first - there isquite a
>>>>>> bit
>>>>>> of code required to actually support multi-clusters that hasbeen
>>>>>> removed.
>>>>>>
>>>>>> Our operational model that was agreed to quite a while agois that
>>>>>> mpirun can -only- extend over a single "cell". You can
>>>>>> connect/accept
>>>>>> multiple mpiruns that are sitting on different cells, butyou cannot
>>>>>> execute a single mpirun across multiple cells.
>>>>>>
>>>>>> Please keep this on your own development branch for now.Bringing it>>>>>> into the trunk will require discussion as this changes theoperating
>>>>>> model, and has significant code consequences when we look at
>>>>>> abnormal
>>>>>> terminations, comm_spawn, etc.
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
>>>>>>
>>>>>>> This check in was in error - I had not realized that thecheckout
>>>>>>> was from
>>>>>>> the 1.3 branch, so we will fix this, and put these into thetrunk
>>>>>>> (1.4).  We
>>>>>>> are going to bring in some limited multi-cluster support -limited
>>>>>>> is the
>>>>>>> operative word.
>>>>>>>
>>>>>>> Rich
>>>>>>>
>>>>>>>
>>>>>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquy...@cisco.com>wrote:
>>>>>>>
>>>>>>>> I notice that Ken Matney (the committer) is not on the devel
>>>>>>>> list; I
>>>>>>>> added him explicitly to the CC line.
>>>>>>>>
>>>>>>>> Ken: please see below.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>> Whoa! We made a decision NOT to support multi-clusterapps in
>>>>>>>>> OMPI
>>>>>>>>> over a year ago!
>>>>>>>>>
>>>>>>>>> Please remove this from 1.3 - we should discuss if/whenthis
>>>>>>>>> would
>>>>>>>>> even be allowed in the trunk.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Ralph
>>>>>>>>>
>>>>>>>>> On Sep 22, 2008, at 10:35 AM, mat...@osl.iu.edu wrote:
>>>>>>>>>
>>>>>>>>>> Author: matney
>>>>>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
>>>>>>>>>> New Revision: 19600
>>>>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/19600
>>>>>>>>>>
>>>>>>>>>> Log:
>>>>>>>>>> Added member to orte_node_t to enable multi-cluster jobsin ALPS
>>>>>>>>>> scheduled systems (like Cray XT).
>>>>>>>>>>
>>>>>>>>>> Text files modified:
>>>>>>>>>> branches/v1.3/orte/runtime/orte_globals.h |     4 ++++
>>>>>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>>>>>>>
>>>>>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>>=================================================================
>>>>>>>>>>
>>>>>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original)
>>>>>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22
>>>>>>>>>> 12:35:54
>>>>>>>>>> EDT (Mon, 22 Sep 2008)
>>>>>>>>>> @@ -222,6 +222,10 @@
>>>>>>>>>> /** Username on this node, if specified */
>>>>>>>>>> char *username;
>>>>>>>>>> char *slot_list;
>>>>>>>>>> + /** Clustername (machine name of cluster) on whichthis
>>>>>>>>>> node
>>>>>>>>>> + resides. ALPS scheduled systems need this toenable
>>>>>>>>>> +        multi-cluster support.  */
>>>>>>>>>> +    char *clustername;
>>>>>>>>>> } orte_node_t;
>>>>>>>>>> ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t);
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> svn mailing list
>>>>>>>>>> s...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600

Reply via email to