Sounds good Ralph; thanks!

On Jun 12, 2007, at 9:54 AM, Ralph H Castain wrote:

Yo all

I made a major commit to the trunk this morning (r15007) that merits general
notification and some explanation.

                   *** IMPORTANT NOTE ***
One major impact of the commit you *may* notice is that support for several environments will be broken. This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows - these environments will
not compile at this time. It has been tested on rsh, SLURM, and Bproc.
Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. I will send out a separate note to developers of the borked environments with suggestions on how to fix the problems. These should be relatively minor, mostly involving a minor change to a couple of function
calls and the addition of one function call in their respective launch

As many of you have noted, the ORTE launch procedure relies heavily on the orte_rml.xcast function to communicate occasionally large messages to every
process in a job. This procedure has - until now - been a linear
communication that sent the messages directly to every process. Obviously, as many of you have pointed out, this was a very inefficient methodology.

This commit repairs that problem, but it comes with a few side effects. You shouldn't notice anything different (except hopefully for faster starts),
but I will note the differences here.

First, orte_rml.xcast has become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that
any message sent via xcast will be delivered to ALL processes in the
specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but
will allow you to send to a specified array of process names.

We also modified orte_rml.xcast so it supports more scalable message routing
methodologies. At the moment, we support three:

(a) direct, which sends the message directly to all recipients. By default, this mode is used whenever we have less than 10 daemons. You can adjust that crossover point via the oob_xcast_linear_xover param - set this param to the number of daemons where you want direct to give way to linear. Obviously, if you set this to something very large, then we will only use direct xcast mode - set it to zero, and we won't use direct at all. Alternatively, you
can force the use of direct at all scales by setting oob_xcast_mode to

(b) linear, which sends the message to the local daemon on each node. The daemon then relays it to its own local procs. Note that the daemons in this mode do not relay the message between themselves, but only to their local procs. As per a prior message, I have set linear to be the default mode on
all jobs involving more than 10 daemons. Again, you can adjust this by
setting a lower bound on where linear kicks in (as described above). You can also set an upper bound where linear gives way to binomial by setting the oob_xcast_binomial_xover param. Alternatively, you can force the use of
linear at all scales by setting oob_xcast_mode to "linear".

(c) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. This is just a typical binomial algorithm across the daemons. At this time, I have set the default on this mode to be "off" so it will never kick on. If you want to try it out, you will need to either adjust the xover param (as described
above), or set oob_xcast_mode to "binomial".

Please note that we *do* use the direct messaging mode whenever there is only one daemon in the system. This is non-negotiable - it is mandated for singletons and for getting mpirun up and running. Besides, if there is only one daemon in the system, every message goes "direct" no matter which mode
you pick, so you shouldn't care. ;-)

Also note that the current crossover points were totally arbitrary. I have no data to base those crossovers on, so I simply picked something for now. If those of you with access to larger systems and with some free time could try various values, then we could come up with something more intelligent.
Any data would be most appreciated!

This commit also involved a significant change to the orteds themselves. The requirement that orteds *always* be available to relay messages mandated a
change in the way they come alive. In the past, orteds bootstrapped
themselves in two totally different code paths: bootproxy or VM. This is no longer supported. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided
a clean system for supporting message relaying.

Note one major impact of this commit: multiple daemons on a node cannot be
supported any longer! Only a single daemon/node is now allowed. You
shouldn't notice any difference as this was always transparent. However, if you are working in an environment where daemons occupied job slots, you
should see some benefit.

Please let me know of any problems. I did my best to test this, but there will undoubtedly be some problems that crop up, and some code paths that are
borked that I didn't see on any of my available machines or in my


devel mailing list

Jeff Squyres
Cisco Systems

Reply via email to