Hi folks

I mentioned this very briefly at the Tues telecon, but didn't explain it well 
as there just wasn't adequate time available. With the recent updates of the 
embedded PMIx code, OMPI's mpirun now has the ability to fully support 
pre-launch network resource assignment for processes. This includes endpoints 
as well as network coordinates.

In brief, what happens is:

* at startup, the PMIx network support plugins in mpirun obtain their network 
configuration info. In cases where a fabric manager is present, we directly 
communicate to that FM for the info we need. Where no fabric manager is 
available, an MCA param can point us to a file containing the info, or the 
plugin can get it in whatever way the vendor chooses

* when ORTE launches its daemons, the daemons query their PMIx network support 
plugins for any network inventory info they would like to communicate back to 
mpirun. Each plugin (TCP, whatever) is given an opportunity to contribute to 
that payload. The data is included in the daemon's "phone home" message

* when the inventory arrives at mpirun, ORTE delivers it to the PMIx network 
support plugins for processing. As far as ORTE is concerned, it is an opaque 
"blob" - only the fabric plugin provider knows what is in it and how to process 
it. In the case of TCP (which I wrote), we store information on both the 
available static ports on each node and the available NICs (e.g., subnet they 
are attached to).

* when mpirun is ready to launch, it passes the process map down to the PMIx 
network support plugins (again, every plugin gets to see it) so they can 
assign/allocate network resources to the procs. In the case of TCP, we assign a 
static socket (or multiple sockets if they request it) to each process on each 
node, a prioritized list of the NICs they can use (based on distance), and the 
network coordinates of the NICs. This all gets bundled up into a per-plugin 
"blob" and passed up to mpirun for inclusion in the launch command sent to the 

* when a daemon receives the launch command, it passes the "blobs" down to the 
local PMIx network support plugins, which parse the blob as they desire. In the 
case of TCP, we simply store the assignment info in the PMIx datastore for 
retrieval by the procs when they want to communicate to a peer or compute a 
topologically aware collective pattern.

The definition of coordinate values for each NIC is up to the network support 
plugins. The pmix_coord_t struct includes an array of integer coordinates along 
with a value indicating the number of dimensions and a flag indicating whether 
it is a "logical" or "physical" view - this is in keeping with the MPI topology 
WG. Some fabrics are writing plugins that provide that info per the vendor's 
algorithms. In the case of TCP, what I've done is rather simple. I provide an 
x,y,z coordinate "logical" coordinate for each NIC where:

* x represents the relative NIC index on the host where the proc is located - 
just a simple counter (e.g., this is the third NIC on the host)

* y represents the switch to which that NIC is attached - i.e., if you have the 
same y-coord as another NIC, you are attached to the same switch

* z represents the subnet - i.e., if you have the same z-coord as another NIC, 
then that NIC is on the same subnet as you

It is totally up to the plugin - the idea is to provide each process with 
information that allows them to know relative location. I'm quite open to 
modifying the TCP one as it was just done as an example for testing the 
infrastructure. You can retrieve coordinate info for any proc using PMIx_Get. 
You can also retrieve the relative communication cost to any proc - the plugin 
will compute it for you based on the coordinates, assuming the plugin supports 
that ability (in the case of my TCP one, it uses the coordinate to compute the 
number of hops because I numbered things to support that algo).

PRRTE already knows how to do all this - there are a few simple changes 
required to sync OMPI. If folks are interested in exploring this further, 
please let me know.

devel mailing list

Reply via email to