Hello all

As some of you may remember, I am in the process of rewriting the IOF subsystem. While working my way through the revisions, I discovered something about the current IOF that significantly impacts scalability. Since I know some people retain interest in that area, I thought I would pass the observations along.

When an orted fork/exec's an application process, it automatically wires up the IOF for that process. In the current system, that entails sending a minimum of three messages to mpirun for each process, each message in turn generating an "ack" message back to the orted. Thus, during launch, the IOF is sending more than 6*nprocs messages across the OOB.

Unfortunately, this is all done outside of our daemon collective system, so every message is handled independently on both ends. As you can imagine, mpirun gets somewhat deluged for large jobs. With the advent of the orte_routed framework, at least these messages don't create new TCP connections - but they do force mpirun to deal with a large number of inbound messages.

Lest someone think the original authors were "stupid", let me hasten to point out that they wrote this system to a clear set of requirements focused on creating a generic RTE - i.e., one not tailored to OMPI's specific needs. Thus, the system was designed to support capabilities we don't need, and couldn't take advantage of any knowledge of the end-state OMPI was trying to achieve.

As an example of the impact, on RoadRunner, the current IOF results in the transmission of over 72,000 messages between procs and mpirun during startup of a petaflop application - just to wireup the IOF.

In the rewrite, I am taking advantage of knowing OMPI's desired final configuration to eliminate -all- of these communications. Should improve things considerably - hope to have it completed in a week or two, though it won't come into the trunk until we release 1.3.

Ralph

Reply via email to