Sounds good Ralph; thanks!
On Jun 12, 2007, at 9:54 AM, Ralph H Castain wrote:
I made a major commit to the trunk this morning (r15007) that
notification and some explanation.
*** IMPORTANT NOTE ***
One major impact of the commit you *may* notice is that support for
environments will be broken. This commit is known to break support
following environments: POE, Xgrid, Xcpu, Windows - these
not compile at this time. It has been tested on rsh, SLURM, and Bproc.
Modifications for TM support have been made but could not be
verified due to
machine problems at LANL. Modifications for SGE have been made but
be verified. I will send out a separate note to developers of the
environments with suggestions on how to fix the problems. These
relatively minor, mostly involving a minor change to a couple of
calls and the addition of one function call in their respective launch
As many of you have noted, the ORTE launch procedure relies heavily
orte_rml.xcast function to communicate occasionally large messages
process in a job. This procedure has - until now - been a linear
communication that sent the messages directly to every process.
as many of you have pointed out, this was a very inefficient
This commit repairs that problem, but it comes with a few side
shouldn't notice anything different (except hopefully for faster
but I will note the differences here.
First, orte_rml.xcast has become a general broadcast-like messaging
Messages can now be sent to any tag on the daemons or processes.
any message sent via xcast will be delivered to ALL processes in the
specified job - you don't get to pick and choose. At a later date,
introduce an augmented capability that will use the daemons as
will allow you to send to a specified array of process names.
We also modified orte_rml.xcast so it supports more scalable
methodologies. At the moment, we support three:
(a) direct, which sends the message directly to all recipients. By
this mode is used whenever we have less than 10 daemons. You can
crossover point via the oob_xcast_linear_xover param - set this
param to the
number of daemons where you want direct to give way to linear.
you set this to something very large, then we will only use direct
mode - set it to zero, and we won't use direct at all.
can force the use of direct at all scales by setting oob_xcast_mode to
(b) linear, which sends the message to the local daemon on each
daemon then relays it to its own local procs. Note that the daemons
mode do not relay the message between themselves, but only to their
procs. As per a prior message, I have set linear to be the default
all jobs involving more than 10 daemons. Again, you can adjust this by
setting a lower bound on where linear kicks in (as described
above). You can
also set an upper bound where linear gives way to binomial by
oob_xcast_binomial_xover param. Alternatively, you can force the
linear at all scales by setting oob_xcast_mode to "linear".
(c) binomial, which sends the message via a binomial algo across
daemons, each of which then relays to its own local procs. This is
typical binomial algorithm across the daemons. At this time, I have
default on this mode to be "off" so it will never kick on. If you
try it out, you will need to either adjust the xover param (as
above), or set oob_xcast_mode to "binomial".
Please note that we *do* use the direct messaging mode whenever
only one daemon in the system. This is non-negotiable - it is
singletons and for getting mpirun up and running. Besides, if there
one daemon in the system, every message goes "direct" no matter
you pick, so you shouldn't care. ;-)
Also note that the current crossover points were totally arbitrary.
no data to base those crossovers on, so I simply picked something
If those of you with access to larger systems and with some free
try various values, then we could come up with something more
Any data would be most appreciated!
This commit also involved a significant change to the orteds
requirement that orteds *always* be available to relay messages
change in the way they come alive. In the past, orteds bootstrapped
themselves in two totally different code paths: bootproxy or VM.
This is no
longer supported. Orteds now always behave like they are part of a
machine - they simply launch a job if mpirun tells them to do so.
another step towards creating an "orteboot" functionality, but also
a clean system for supporting message relaying.
Note one major impact of this commit: multiple daemons on a node
supported any longer! Only a single daemon/node is now allowed. You
shouldn't notice any difference as this was always transparent.
you are working in an environment where daemons occupied job slots,
should see some benefit.
Please let me know of any problems. I did my best to test this, but
will undoubtedly be some problems that crop up, and some code paths
borked that I didn't see on any of my available machines or in my
devel mailing list