Hello all
As some of you may remember, I am in the process of rewriting the IOF
subsystem. While working my way through the revisions, I discovered
something about the current IOF that significantly impacts
scalability. Since I know some people retain interest in that area, I
thought I would pass the observations along.
When an orted fork/exec's an application process, it automatically
wires up the IOF for that process. In the current system, that entails
sending a minimum of three messages to mpirun for each process, each
message in turn generating an "ack" message back to the orted. Thus,
during launch, the IOF is sending more than 6*nprocs messages across
the OOB.
Unfortunately, this is all done outside of our daemon collective
system, so every message is handled independently on both ends. As you
can imagine, mpirun gets somewhat deluged for large jobs. With the
advent of the orte_routed framework, at least these messages don't
create new TCP connections - but they do force mpirun to deal with a
large number of inbound messages.
Lest someone think the original authors were "stupid", let me hasten
to point out that they wrote this system to a clear set of
requirements focused on creating a generic RTE - i.e., one not
tailored to OMPI's specific needs. Thus, the system was designed to
support capabilities we don't need, and couldn't take advantage of any
knowledge of the end-state OMPI was trying to achieve.
As an example of the impact, on RoadRunner, the current IOF results in
the transmission of over 72,000 messages between procs and mpirun
during startup of a petaflop application - just to wireup the IOF.
In the rewrite, I am taking advantage of knowing OMPI's desired final
configuration to eliminate -all- of these communications. Should
improve things considerably - hope to have it completed in a week or
two, though it won't come into the trunk until we release 1.3.
Ralph
- [OMPI devel] IOF and scalability Ralph Castain
-