Re: [OMPI devel] Major commit to trunk

Jeff Squyres Tue, 12 Jun 2007 16:49:26 -0400

Sounds good Ralph; thanks!

On Jun 12, 2007, at 9:54 AM, Ralph H Castain wrote:

Yo all
I made a major commit to the trunk this morning (r15007) thatmerits general
notification and some explanation.

                   *** IMPORTANT NOTE ***
One major impact of the commit you *may* notice is that support forseveralenvironments will be broken. This commit is known to break supportfor thefollowing environments: POE, Xgrid, Xcpu, Windows - theseenvironments will
not compile at this time. It has been tested on rsh, SLURM, and Bproc.
Modifications for TM support have been made but could not beverified due tomachine problems at LANL. Modifications for SGE have been made butcould notbe verified. I will send out a separate note to developers of theborkedenvironments with suggestions on how to fix the problems. Theseshould berelatively minor, mostly involving a minor change to a couple offunction
calls and the addition of one function call in their respective launch
functions.
As many of you have noted, the ORTE launch procedure relies heavilyon theorte_rml.xcast function to communicate occasionally large messagesto every
process in a job. This procedure has - until now - been a linear
communication that sent the messages directly to every process.Obviously,as many of you have pointed out, this was a very inefficientmethodology.
This commit repairs that problem, but it comes with a few sideeffects. Youshouldn't notice anything different (except hopefully for fasterstarts),
but I will note the differences here.
First, orte_rml.xcast has become a general broadcast-like messagingsystem.Messages can now be sent to any tag on the daemons or processes.Note that
any message sent via xcast will be delivered to ALL processes in the
specified job - you don't get to pick and choose. At a later date,we willintroduce an augmented capability that will use the daemons asrelays, but
will allow you to send to a specified array of process names.
We also modified orte_rml.xcast so it supports more scalablemessage routing
methodologies. At the moment, we support three:
(a) direct, which sends the message directly to all recipients. Bydefault,this mode is used whenever we have less than 10 daemons. You canadjust thatcrossover point via the oob_xcast_linear_xover param - set thisparam to thenumber of daemons where you want direct to give way to linear.Obviously, ifyou set this to something very large, then we will only use directxcastmode - set it to zero, and we won't use direct at all.Alternatively, you
can force the use of direct at all scales by setting oob_xcast_mode to
"direct".
(b) linear, which sends the message to the local daemon on eachnode. Thedaemon then relays it to its own local procs. Note that the daemonsin thismode do not relay the message between themselves, but only to theirlocalprocs. As per a prior message, I have set linear to be the defaultmode on
all jobs involving more than 10 daemons. Again, you can adjust this by
setting a lower bound on where linear kicks in (as describedabove). You canalso set an upper bound where linear gives way to binomial bysetting theoob_xcast_binomial_xover param. Alternatively, you can force theuse of
linear at all scales by setting oob_xcast_mode to "linear".
(c) binomial, which sends the message via a binomial algo acrossall thedaemons, each of which then relays to its own local procs. This isjust atypical binomial algorithm across the daemons. At this time, I haveset thedefault on this mode to be "off" so it will never kick on. If youwant totry it out, you will need to either adjust the xover param (asdescribed
above), or set oob_xcast_mode to "binomial".
Please note that we *do* use the direct messaging mode wheneverthere isonly one daemon in the system. This is non-negotiable - it ismandated forsingletons and for getting mpirun up and running. Besides, if thereis onlyone daemon in the system, every message goes "direct" no matterwhich mode
you pick, so you shouldn't care. ;-)
Also note that the current crossover points were totally arbitrary.I haveno data to base those crossovers on, so I simply picked somethingfor now.If those of you with access to larger systems and with some freetime couldtry various values, then we could come up with something moreintelligent.
Any data would be most appreciated!
This commit also involved a significant change to the ortedsthemselves. Therequirement that orteds *always* be available to relay messagesmandated a
change in the way they come alive. In the past, orteds bootstrapped
themselves in two totally different code paths: bootproxy or VM.This is nolonger supported. Orteds now always behave like they are part of avirtualmachine - they simply launch a job if mpirun tells them to do so.This isanother step towards creating an "orteboot" functionality, but alsoprovided
a clean system for supporting message relaying.
Note one major impact of this commit: multiple daemons on a nodecannot be
supported any longer! Only a single daemon/node is now allowed. You
shouldn't notice any difference as this was always transparent.However, ifyou are working in an environment where daemons occupied job slots,you
should see some benefit.
Please let me know of any problems. I did my best to test this, buttherewill undoubtedly be some problems that crop up, and some code pathsthat are
borked that I didn't see on any of my available machines or in my
configurations.

Ralph


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] Major commit to trunk

Reply via email to