Great find Ralph! On Wed, Aug 27, 2008 at 7:39 PM, Ralph Castain <r...@lanl.gov> wrote: > Hello all > > As some of you may remember, I am in the process of rewriting the IOF > subsystem. While working my way through the revisions, I discovered > something about the current IOF that significantly impacts scalability. > Since I know some people retain interest in that area, I thought I would > pass the observations along. > > When an orted fork/exec's an application process, it automatically wires up > the IOF for that process. In the current system, that entails sending a > minimum of three messages to mpirun for each process, each message in turn > generating an "ack" message back to the orted. Thus, during launch, the IOF > is sending more than 6*nprocs messages across the OOB. > > Unfortunately, this is all done outside of our daemon collective system, so > every message is handled independently on both ends. As you can imagine, > mpirun gets somewhat deluged for large jobs. With the advent of the > orte_routed framework, at least these messages don't create new TCP > connections - but they do force mpirun to deal with a large number of > inbound messages. > > Lest someone think the original authors were "stupid", let me hasten to > point out that they wrote this system to a clear set of requirements focused > on creating a generic RTE - i.e., one not tailored to OMPI's specific needs. > Thus, the system was designed to support capabilities we don't need, and > couldn't take advantage of any knowledge of the end-state OMPI was trying to > achieve. > > As an example of the impact, on RoadRunner, the current IOF results in the > transmission of over 72,000 messages between procs and mpirun during startup > of a petaflop application - just to wireup the IOF. > > In the rewrite, I am taking advantage of knowing OMPI's desired final > configuration to eliminate -all- of these communications. Should improve > things considerably - hope to have it completed in a week or two, though it > won't come into the trunk until we release 1.3. > > Ralph > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
-- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/