[OMPI devel] IOF and scalability

Ralph Castain Wed, 27 Aug 2008 19:39:59 -0400

Hello all

As some of you may remember, I am in the process of rewriting the IOFsubsystem. While working my way through the revisions, I discoveredsomething about the current IOF that significantly impactsscalability. Since I know some people retain interest in that area, Ithought I would pass the observations along.

When an orted fork/exec's an application process, it automaticallywires up the IOF for that process. In the current system, that entailssending a minimum of three messages to mpirun for each process, eachmessage in turn generating an "ack" message back to the orted. Thus,during launch, the IOF is sending more than 6*nprocs messages acrossthe OOB.

Unfortunately, this is all done outside of our daemon collectivesystem, so every message is handled independently on both ends. As youcan imagine, mpirun gets somewhat deluged for large jobs. With theadvent of the orte_routed framework, at least these messages don'tcreate new TCP connections - but they do force mpirun to deal with alarge number of inbound messages.

Lest someone think the original authors were "stupid", let me hastento point out that they wrote this system to a clear set ofrequirements focused on creating a generic RTE - i.e., one nottailored to OMPI's specific needs. Thus, the system was designed tosupport capabilities we don't need, and couldn't take advantage of anyknowledge of the end-state OMPI was trying to achieve.

As an example of the impact, on RoadRunner, the current IOF results inthe transmission of over 72,000 messages between procs and mpirunduring startup of a petaflop application - just to wireup the IOF.

In the rewrite, I am taking advantage of knowing OMPI's desired finalconfiguration to eliminate -all- of these communications. Shouldimprove things considerably - hope to have it completed in a week ortwo, though it won't come into the trunk until we release 1.3.


Ralph

[OMPI devel] IOF and scalability

Reply via email to