Sounds good Ralph; thanks!
On Jun 12, 2007, at 9:54 AM, Ralph H Castain wrote:
Yo all
I made a major commit to the trunk this morning (r15007) that
merits general
notification and some explanation.
*** IMPORTANT NOTE ***
One major impact of the commit you *may* notice is that support for
several
environments will be broken. This commit is known to break support
for the
following environments: POE, Xgrid, Xcpu, Windows - these
environments will
not compile at this time. It has been tested on rsh, SLURM, and Bproc.
Modifications for TM support have been made but could not be
verified due to
machine problems at LANL. Modifications for SGE have been made but
could not
be verified. I will send out a separate note to developers of the
borked
environments with suggestions on how to fix the problems. These
should be
relatively minor, mostly involving a minor change to a couple of
function
calls and the addition of one function call in their respective launch
functions.
As many of you have noted, the ORTE launch procedure relies heavily
on the
orte_rml.xcast function to communicate occasionally large messages
to every
process in a job. This procedure has - until now - been a linear
communication that sent the messages directly to every process.
Obviously,
as many of you have pointed out, this was a very inefficient
methodology.
This commit repairs that problem, but it comes with a few side
effects. You
shouldn't notice anything different (except hopefully for faster
starts),
but I will note the differences here.
First, orte_rml.xcast has become a general broadcast-like messaging
system.
Messages can now be sent to any tag on the daemons or processes.
Note that
any message sent via xcast will be delivered to ALL processes in the
specified job - you don't get to pick and choose. At a later date,
we will
introduce an augmented capability that will use the daemons as
relays, but
will allow you to send to a specified array of process names.
We also modified orte_rml.xcast so it supports more scalable
message routing
methodologies. At the moment, we support three:
(a) direct, which sends the message directly to all recipients. By
default,
this mode is used whenever we have less than 10 daemons. You can
adjust that
crossover point via the oob_xcast_linear_xover param - set this
param to the
number of daemons where you want direct to give way to linear.
Obviously, if
you set this to something very large, then we will only use direct
xcast
mode - set it to zero, and we won't use direct at all.
Alternatively, you
can force the use of direct at all scales by setting oob_xcast_mode to
"direct".
(b) linear, which sends the message to the local daemon on each
node. The
daemon then relays it to its own local procs. Note that the daemons
in this
mode do not relay the message between themselves, but only to their
local
procs. As per a prior message, I have set linear to be the default
mode on
all jobs involving more than 10 daemons. Again, you can adjust this by
setting a lower bound on where linear kicks in (as described
above). You can
also set an upper bound where linear gives way to binomial by
setting the
oob_xcast_binomial_xover param. Alternatively, you can force the
use of
linear at all scales by setting oob_xcast_mode to "linear".
(c) binomial, which sends the message via a binomial algo across
all the
daemons, each of which then relays to its own local procs. This is
just a
typical binomial algorithm across the daemons. At this time, I have
set the
default on this mode to be "off" so it will never kick on. If you
want to
try it out, you will need to either adjust the xover param (as
described
above), or set oob_xcast_mode to "binomial".
Please note that we *do* use the direct messaging mode whenever
there is
only one daemon in the system. This is non-negotiable - it is
mandated for
singletons and for getting mpirun up and running. Besides, if there
is only
one daemon in the system, every message goes "direct" no matter
which mode
you pick, so you shouldn't care. ;-)
Also note that the current crossover points were totally arbitrary.
I have
no data to base those crossovers on, so I simply picked something
for now.
If those of you with access to larger systems and with some free
time could
try various values, then we could come up with something more
intelligent.
Any data would be most appreciated!
This commit also involved a significant change to the orteds
themselves. The
requirement that orteds *always* be available to relay messages
mandated a
change in the way they come alive. In the past, orteds bootstrapped
themselves in two totally different code paths: bootproxy or VM.
This is no
longer supported. Orteds now always behave like they are part of a
virtual
machine - they simply launch a job if mpirun tells them to do so.
This is
another step towards creating an "orteboot" functionality, but also
provided
a clean system for supporting message relaying.
Note one major impact of this commit: multiple daemons on a node
cannot be
supported any longer! Only a single daemon/node is now allowed. You
shouldn't notice any difference as this was always transparent.
However, if
you are working in an environment where daemons occupied job slots,
you
should see some benefit.
Please let me know of any problems. I did my best to test this, but
there
will undoubtedly be some problems that crop up, and some code paths
that are
borked that I didn't see on any of my available machines or in my
configurations.
Ralph
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems