Hey folks

I've had a few questions lately about how ORCM plans to support fast MPI 
startup, so I figured I'd pass along some notes on the matter. Nothing secret 
or hush-hush about it - these are things we've discussed in the OMPI world a 
few times, and have simply adopted/implemented in ORCM. Indeed, the early steps 
are now in OMPI or soon to come over there.

There are a few key elements to the revised wireup plan, some of which are 
implemented in ORCM and some in OMPI:

* more scalable barrier algorithms in the RTE. We took a first step towards 
this with the PMIx modifications, and Mellanox is working on the next phase. I 
have a final follow-on phase that will further reduce the time required for a 
cross-cluster barrier to a very low level. When completed, we expect to have 
this executing in rather short time intervals (we'll provide numbers as we 
measure them)

* RM management of connection information. Much of the data we modex around is 
actually "static" - e.g., we send NIC-level info on GIDs and LIDs for 
Infiniband. These do change upon restart of the respective fabric manager, but 
that is a rare and detectable occurrence. Now that the BTLs are in the OPAL 
layer and thus accessible from outside of MPI, ORCM's daemons are querying the 
BTLs for this non-job-level information and including it in their inventory 
report. Thus, each daemon now has access to all that info at time-zero, and 
there is no longer a need to include it in the modex. The table is being 
provided on a shared memory basis to each process by the daemon to minimize the 
memory footprint.

* RM assignment of rendezvous endpoints. Most fabrics utilize either 
connectionless endpoints, or have a rendezvous protocol whereby two procs can 
exchange a handshake for dynamic assignment of endpoints. In either case, 
ORCM's daemons will query the fabric at startup to identify assignable 
endpoints, and then assign them on a node-rank basis to processes as they are 
launched. Procs can then use the provided table to lookup the endpoint info for 
any peer in the system, and connect to it as desired. The table is again being 
provided on a shared memory basis by the daemon.

* conversion of BTLs to only call modex_recv on first message instead of at 
startup. This is required not only for improvement of startup times, but also 
for dealing with memory footprint as we get to ever larger scale. The fact is 
that most MPI apps only involve a rank talking to a small subset of its peers. 
Thus, having each rank "grab" all endpoint info for every other process in the 
job at startup wastes both time and memory as most of that data will never be 
used. Some of the BTLs already have been modified for this mode of operation, 
and the PMIx change takes advantage of it when only those BTLs are used. This 
becomes less of an issue for fabrics where the RM can fully manage endpoints, 
but is a significant enhancement for situations where we cannot do so.

* distributed mapping. Right now, mpirun computes the entire process map and 
sends it around to all the daemons, who then extract their list of local 
processes to start. This results in a fairly large launch message. The next 
update coming in a couple of weeks will shift that burden to the daemons by 
having mpirun send only the app_context info to the daemons. Each daemon will 
independently compute the map, spawning its own local procs as they are 
computed. This will result in a much smaller launch message and help further 
parallelize the procedure.

* use of faster transports. ORCM's daemons will utilize the fabric during job 
startup to move messages and barriers around, and then fallback to the OOB 
transport for any communication during job execution in order to minimize 
interference with the application. 

It'll take time to get all this integrated with OMPI (all done in plugins so 
other RMs can integrate as required/available), but we hope to have most of 
this available in the next 6-8 months.

HTH - always happy to provide more detail, and/or welcome contributions
Ralph

Reply via email to