Let me make the point that adding a data structure is much less destabilization to the tree than the routine day-to-day changes that go on in the tree.
Rich On 9/23/08 6:24 AM, "Terry D. Dontje" <terry.don...@sun.com> wrote: > Jeff Squyres wrote: >> > I think the point is that as a group, we consciously, deliberately, >> > and painfully decided not to support multi-cluster. And as a result, >> > we ripped out a lot of supporting code. Starting down this path again >> > will likely result in a) re-opening all the discussions, b) re-adding >> > a lot of code (or code effectively similar to what was there before). >> > Let's not forget that there were many unsolved problems surrounding >> > multi-cluster last time, too. >> > >> > It was also pointed out in Ralph's mails that, at least from the >> > descriptions provided, adding the field in orte_node_t does not >> > actually solve the problem that ORNL is trying to solve. >> > >> > If we, as a group, decide to re-add all this stuff, then a) recognize >> > that we are flip-flopping *again* on this issue, and b) it will take a >> > lot of coding effort to do so. I do think that since this was a group >> > decision last time, it should be a group decision this time, too. If >> > this does turn out to be as large of a sub-project as described, I >> > would be opposed to the development occurring on the trunk; hg trees >> > are perfect for this kind of stuff. >> > >> > I personally have no customers who are doing cross-cluster kinds of >> > things, so I don't personally care if cross-cluster functionality >> > works its way [back] in. But I recognize that OMPI core members are >> > investigating it. So the points I'm making are procedural; I have no >> > real dog in this fight... >> > >> > > I agree with Jeff that this is perfect for an hg tree. Though I also > don't have a dog in this fight but I have a cat that would rather stay > comfortably sleeping and not have someone step on its tail :-). In > other words knock yourself out but please don't destabilize the trunk. > Of course that begs the question what happens when the hg tree is done > and working? > > --td > >> > On Sep 22, 2008, at 4:40 PM, George Bosilca wrote: >> > >>> >> Ralph, >>> >> >>> >> There is NO need to have this discussion again, it was painful enough >>> >> last time. From my perspective I do not understand why are you making >>> >> so much noise on this one. How a 4 lines change in some ALPS specific >>> >> files (Cray system very specific to ORNL) can generate more than 3 A4 >>> >> pages of emails, is still something out of my perception. >>> >> >>> >> If they want to do multi-cluster and they do not break anything in >>> >> ORTE/OMPI and they do not ask other people to do it for them why >>> >> trying to stop them ? >>> >> >>> >> george. >>> >> >>> >> On Sep 22, 2008, at 3:59 PM, Ralph Castain wrote: >>> >> >>>> >>> There was a very long drawn-out discussion about this early in 2007. >>>> >>> Rather than rehash all that, I'll try to summarize it here. It may >>>> >>> get confusing - it helped a whole lot to be in a room with a >>>> >>> whiteboard. There were also presentations on the subject - I believe >>>> >>> the slides may still be in the docs repository. >>>> >>> >>>> >>> Because terminology quickly gets confusing, we adopted a slightly >>>> >>> different one for these discussions. We talk about OMPI being a >>>> >>> "single cell" system - i.e., jobs executed via mpirun can only span >>>> >>> nodes that are reachable by that mpirun. In a typical managed >>>> >>> environment, a cell aligns quite well with a "cluster". In an >>>> >>> unmanaged environment where the user provides a hostfile, the cell >>>> >>> will contain all nodes specified in the hostfile. >>>> >>> >>>> >>> We don't filter or abort for non-matching hostnames - if mpirun can >>>> >>> launch on that node, then great. What we don't support is asking >>>> >>> mpirun to remotely execute another mpirun on the frontend of another >>>> >>> cell in order to launch procs on the nodes in -that- cell, nor do we >>>> >>> ask mpirun to in any way manage (or even know about) any procs >>>> >>> running on a remote cell. >>>> >>> >>>> >>> I see what you are saying about the ALPS node name. However, the >>>> >>> field you want to add doesn't have anything to do with >>>> >>> accept/connect. The orte_node_t object is used solely by mpirun to >>>> >>> keep track of the node pool it controls - i.e., the nodes upon which >>>> >>> it is launching jobs. Thus, the mpirun on cluster A will have >>>> >>> "nidNNNN" entries it got from its allocation, and the mpirun on >>>> >>> cluster B will have "nidNNNN" entries it got from its allocation - >>>> >>> but the two mpiruns will never exchange that information, nor will >>>> >>> the mpirun on cluster A ever have a need to know the node entries >>>> >>> for cluster B. Each mpirun launches and manages procs -only- on the >>>> >>> nodes in its own allocation. >>>> >>> >>>> >>> I agree you will have issues when doing the connect/accept modex as >>>> >>> the nodenames are exchanged and are no longer unique in your >>>> >>> scenario. However, that info stays in the ompi_proc_t - it never >>>> >>> gets communicated to the ORTE layer as we couldn't care less down >>>> >>> there about the remote procs since they are under the control of a >>>> >>> different mpirun. So if you need to add a cluster id field for this >>>> >>> purpose, it needs to go in ompi_proc_t - not in the orte structures. >>>> >>> >>>> >>> And for that, you probably need to discuss it with the MPI team as >>>> >>> changes to ompi_proc_t will likely generate considerable discussion. >>>> >>> >>>> >>> FWIW: this is one reason I warned Galen about the problems in >>>> >>> reviving multi-cluster operations again. We used to deal with >>>> >>> multi-cells in the process name itself, but all that support has >>>> >>> been removed from OMPI. >>>> >>> >>>> >>> Hope that helps >>>> >>> Ralph >>>> >>> >>>> >>> On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote: >>>> >>> >>>>> >>>> I may be opening a can of worms... >>>>> >>>> >>>>> >>>> But, what prevents a user from running across clusters in a "normal >>>>> >>>> OMPI", i.e., non-ALPS environment? When he puts hosts into his >>>>> >>>> hostfile, does it parse and abort/filter non-matching hostnames? The >>>>> >>>> problem for ALPS based systems is that nodes are addressed via >>>>> NID,PID >>>>> >>>> pairs at the portals level. Thus, these are unique only within a >>>>> >>>> cluster. In point of fact, I could rewrite all of the ALPS support to >>>>> >>>> identify the nodes by "cluster_id".NID. It would be a bit >>>>> inefficient >>>>> >>>> within a cluster because, we would have to extract the NID from this >>>>> >>>> syntax as we go down to the portals layer. It also would lead to a >>>>> >>>> larger degree of change within the OMPI ALPS code base. However, I >>>>> >>>> can >>>>> >>>> give ALPS-based systems the same feature set as the rest of the >>>>> world. >>>>> >>>> It just is more efficient to use an additional pointer in the >>>>> >>>> orte_node_t structure and results is a far simpler code structure. >>>>> >>>> This >>>>> >>>> makes it easier to maintain. >>>>> >>>> >>>>> >>>> The only thing that "this change" really does is to identify the >>>>> >>>> cluster >>>>> >>>> under which the ALPS allocation is made. If you are addressing a node >>>>> >>>> in another cluster, (e.g., via accept/connect), the clustername/NID >>>>> >>>> pair >>>>> >>>> is unique for ALPS as a hostname on a cluster node is unique between >>>>> >>>> clusters. If you do a gethostname() on a normal cluster node, you are >>>>> >>>> going to get mynameNNNNN, or something similar. If you do a >>>>> >>>> gethostname() on an ALPS node, you are going to get nidNNNNN; there is >>>>> >>>> no differentiation between cluster A and cluster B. >>>>> >>>> >>>>> >>>> Perhaps, my earlier comment was not accurate. In reality, it >>>>> provides >>>>> >>>> the same degree of identification for ALPS nodes as hostname provides >>>>> >>>> for normal clusters. From your perspective, it is immaterial that it >>>>> >>>> also would allow us to support our limited form of multi-cluster >>>>> >>>> support. However, of and by itself, it only provides the same >>>>> >>>> level of >>>>> >>>> identification as is done for other cluster nodes. >>>>> >>>> -- >>>>> >>>> Ken >>>>> >>>> >>>>> >>>> >>>>> >>>> -----Original Message----- >>>>> >>>> From: Ralph Castain [mailto:r...@lanl.gov] >>>>> >>>> Sent: Monday, September 22, 2008 2:33 PM >>>>> >>>> To: Open MPI Developers >>>>> >>>> Cc: Matney Sr, Kenneth D. >>>>> >>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600 >>>>> >>>> >>>>> >>>> The issue isn't with adding a string. The question is whether or not >>>>> >>>> OMPI is to support one job running across multiple clusters. We made a >>>>> >>>> conscious decision (after lengthy discussions on OMPI core and ORTE >>>>> >>>> mailing lists, plus several telecons) to not do so - we require that >>>>> >>>> the job execute on a single cluster, while allowing connect/accept to >>>>> >>>> occur between jobs on different clusters. >>>>> >>>> >>>>> >>>> It is difficult to understand why we need a string (or our old "cell >>>>> >>>> id") to tell us which cluster we are on if we are only following that >>>>> >>>> operating model. From the commit comment, and from what I know of the >>>>> >>>> system, the only rationale for adding such a designator is to shift >>>>> >>>> back to the one-mpirun-spanning-multiple-cluster model. >>>>> >>>> >>>>> >>>> If we are now going to make that change, then it merits a similar >>>>> >>>> level of consideration as the last decision to move away from that >>>>> >>>> model. Making that move involves considerably more than just adding a >>>>> >>>> cluster id string. You may think that now, but the next step is >>>>> >>>> inevitably to bring back remote launch, killing jobs on all clusters >>>>> >>>> when one cluster has a problem, etc. >>>>> >>>> >>>>> >>>> Before we go down this path and re-open Pandora's box, we should at >>>>> >>>> least agree that is what we intend to do...or agree on what hard >>>>> >>>> constraints we will place on multi-cluster operations. Frankly, I'm >>>>> >>>> tired of bouncing back-and-forth on even the most basic design >>>>> >>>> decisions. >>>>> >>>> >>>>> >>>> Ralph >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> On Sep 22, 2008, at 11:55 AM, Richard Graham wrote: >>>>> >>>> >>>>>> >>>>> What Ken put in is what is needed for the limited multi-cluster >>>>>> >>>>> capabilities >>>>>> >>>>> we need, just one additional string. I don't think there is a need >>>>>> >>>>> for any >>>>>> >>>>> discussion of such a small change. >>>>>> >>>>> >>>>>> >>>>> Rich >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> On 9/22/08 1:32 PM, "Ralph Castain" <r...@lanl.gov> wrote: >>>>>> >>>>> >>>>>>> >>>>>> We really should discuss that as a group first - there is quite a >>>>>>> >>>>>> bit >>>>>>> >>>>>> of code required to actually support multi-clusters that has been >>>>>>> >>>>>> removed. >>>>>>> >>>>>> >>>>>>> >>>>>> Our operational model that was agreed to quite a while ago is that >>>>>>> >>>>>> mpirun can -only- extend over a single "cell". You can >>>>>>> >>>>>> connect/accept >>>>>>> >>>>>> multiple mpiruns that are sitting on different cells, but you cannot >>>>>>> >>>>>> execute a single mpirun across multiple cells. >>>>>>> >>>>>> >>>>>>> >>>>>> Please keep this on your own development branch for now. Bringing it >>>>>>> >>>>>> into the trunk will require discussion as this changes the >>>>>>> operating >>>>>>> >>>>>> model, and has significant code consequences when we look at >>>>>>> >>>>>> abnormal >>>>>>> >>>>>> terminations, comm_spawn, etc. >>>>>>> >>>>>> >>>>>>> >>>>>> Thanks >>>>>>> >>>>>> Ralph >>>>>>> >>>>>> >>>>>>> >>>>>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote: >>>>>>> >>>>>> >>>>>>>> >>>>>>> This check in was in error - I had not realized that the >>>>>>>> checkout >>>>>>>> >>>>>>> was from >>>>>>>> >>>>>>> the 1.3 branch, so we will fix this, and put these into the trunk >>>>>>>> >>>>>>> (1.4). We >>>>>>>> >>>>>>> are going to bring in some limited multi-cluster support - limited >>>>>>>> >>>>>>> is the >>>>>>>> >>>>>>> operative word. >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> Rich >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>>>>>>> >>>>>>> >>>>>>>>> >>>>>>>> I notice that Ken Matney (the committer) is not on the devel >>>>>>>>> >>>>>>>> list; I >>>>>>>>> >>>>>>>> added him explicitly to the CC line. >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Ken: please see below. >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote: >>>>>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>> Whoa! We made a decision NOT to support multi-cluster apps in >>>>>>>>>> >>>>>>>>> OMPI >>>>>>>>>> >>>>>>>>> over a year ago! >>>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>> Please remove this from 1.3 - we should discuss if/when this >>>>>>>>>> >>>>>>>>> would >>>>>>>>>> >>>>>>>>> even be allowed in the trunk. >>>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>> Ralph >>>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>> On Sep 22, 2008, at 10:35 AM, mat...@osl.iu.edu wrote: >>>>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Author: matney >>>>>>>>>>> >>>>>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008) >>>>>>>>>>> >>>>>>>>>> New Revision: 19600 >>>>>>>>>>> >>>>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/19600 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Log: >>>>>>>>>>> >>>>>>>>>> Added member to orte_node_t to enable multi-cluster jobs in ALPS >>>>>>>>>>> >>>>>>>>>> scheduled systems (like Cray XT). >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Text files modified: >>>>>>>>>>> >>>>>>>>>> branches/v1.3/orte/runtime/orte_globals.h | 4 ++++ >>>>>>>>>>> >>>>>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-) >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> = >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> ================================================================= >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original) >>>>>>>>>>> >>>>>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22 >>>>>>>>>>> >>>>>>>>>> 12:35:54 >>>>>>>>>>> >>>>>>>>>> EDT (Mon, 22 Sep 2008) >>>>>>>>>>> >>>>>>>>>> @@ -222,6 +222,10 @@ >>>>>>>>>>> >>>>>>>>>> /** Username on this node, if specified */ >>>>>>>>>>> >>>>>>>>>> char *username; >>>>>>>>>>> >>>>>>>>>> char *slot_list; >>>>>>>>>>> >>>>>>>>>> + /** Clustername (machine name of cluster) on which this >>>>>>>>>>> >>>>>>>>>> node >>>>>>>>>>> >>>>>>>>>> + resides. ALPS scheduled systems need this to enable >>>>>>>>>>> >>>>>>>>>> + multi-cluster support. */ >>>>>>>>>>> >>>>>>>>>> + char *clustername; >>>>>>>>>>> >>>>>>>>>> } orte_node_t; >>>>>>>>>>> >>>>>>>>>> ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t); >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>>> >>>>>>>>>> svn mailing list >>>>>>>>>>> >>>>>>>>>> s...@open-mpi.org >>>>>>>>>>> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn >>>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>>> >>>>>>>>> devel mailing list >>>>>>>>>> >>>>>>>>> de...@open-mpi.org >>>>>>>>>> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>>> >>>>>>> devel mailing list >>>>>>>> >>>>>>> de...@open-mpi.org >>>>>>>> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>>> >>>>>> devel mailing list >>>>>>> >>>>>> de...@open-mpi.org >>>>>>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>>> >>>>> _______________________________________________ >>>>>> >>>>> devel mailing list >>>>>> >>>>> de...@open-mpi.org >>>>>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> >>> >>>> >>> _______________________________________________ >>>> >>> devel mailing list >>>> >>> de...@open-mpi.org >>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >>> >> _______________________________________________ >>> >> devel mailing list >>> >> de...@open-mpi.org >>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >