Whilst I can see these changes being good in the general case (most clusters are designed with very smart NICs and painfully dumb switches, because that produces the best latencies for many topologies), I would suggest that we can do better on smarter networks.
There is no obvious reason why you could not establish a well-known multicast address/port for out-of-band traffic. A reliable multicast protocol, such as SRM, NORM or FLUTE could then be used to carry the information between nodes. The advantage of this approach is that it requires the least alteration to the code - a single transmission to the group address as opposed to one transmission to each target - AND would work perfectly well with the new approach described. The drawbacks are that it would have to be switchable, though, as multicast is truly horrible on dumber devices, development resources aren't infinite and the number of cases it will actually win on are limited. (It's entirely coincidental that this is a capability that I actually need. Well, almost!) Jonathan Day > Message: 1 > Date: Mon, 04 Dec 2006 06:26:26 -0700 > From: Ralph Castain <r...@lanl.gov> > Subject: [OMPI devel] Major revision to the RML/OOB > To: Open MPI Core Developers > <devel-c...@open-mpi.org>, Open MPI > Developers <de...@open-mpi.org> > Message-ID: <c1997012.b86%...@lanl.gov> > Content-Type: text/plain; charset="US-ASCII" > > Hello all > > If you are interested in the ongoing scalability > work, or in the RML/OOB in > ORTE, please read on - otherwise, feel free to hit > "delete". > > As many of you know, we have been working towards > solving several problems > that affect our ability to operate at large scale. > Some of the required > modifications to the code base have recently been > applied to the trunk. > > We have known since it was originally written over > two years ago that the > OOB contained some inherent scalability limits. For > example, the system > immediately upon opening obtains contact info for > all daemons in the > universe, opens sockets to them, and sends an > initial message to them. It > then does the same with all the application > processes in its job. > > As a result, for a 2000 process job running on 500 > nodes, each application > process will immediately open and communicate across > 2501 sockets (2000 > procs + 500 daemons [one per node] + the HNP) during > the startup phase. > > If you really want to imagine some fun, now have > that job comm_spawn 500 > processes across the 500 nodes, and *don't* reuse > daemons. As each new > daemon is spawned, every process in the original job > (including the original > daemons) is notified, loads the new contact info for > that daemon, opens a > socket to it, and does an "ack" comm. After all 500 > new daemons are running, > they now launch the 500 new procs, each of which > gets the info on 1000 > daemons plus the info for 2000 parents and 500 > peers, and immediately opens > 1000 daemons + 2000 parents + 500 peers + 1 HNP = > 3501 sockets! > > This was acceptable for small jobs, but causes > considerable delay during > startup for large jobs. A few other OOB operational > characteristics further > exacerbate the problem - I will detail those in a > document on the wiki to > help foster greater understanding. > > Jeff Squyres and I are about to begin a major > revision of the RML/OOB code > to resolve these problems. We will be using a staged > approach to the effort: > > 1. separate the OOB's actions for loading contact > info from actually opening > a socket to a process. Currently, the OOB > immediately opens a socket and > performs an "ack" communication whenever contact > info for another process is > loaded into it. In addition, the OOB immediately > subscribes to the job > segment of the provided process, requesting that > this process be alerted to > *any* change in OOB contact info to any process in > that job. These actions > need to be separated out. > > 2. revise the RML/OOB init/open procedure. These are > currently interwoven in > a manner that causes the OOB to execute registry > operations that are not > needed (and actually cause headaches) during > orte_init. The procedure will > be revised so that connections to the HNP and to the > process' local orted > are opened, but all other contact info (e.g., for > the other procs in the > job) is simply loaded into the OOB's contact tables, > but no sockets opened > until first communication. > > 3. revise the xcast procedure so that it relays via > the daemons and not the > application processes. For systems that do not use > our daemons, alternative > mechanisms will be developed. > > At some point in the future, a fully routable OOB > will be developed to > remove the need for so many sockets on each > application process. For now, > these steps should improve our startup time > considerably. > > With some luck and (hopefully) not too many > conflicting priorities, Jeff and > I may complete this work by Christmas - more likely, > though, is sometime > early in Jan. We will be working on a tmp branch, > but you may see some > transfer of code to the trunk as we progress. > > As always, feel free to comment and/or make > suggestions! > Ralph ____________________________________________________________________________________ Want to start your own business? Learn how on Yahoo! Small Business. http://smallbusiness.yahoo.com/r-index