I'll be writing a series of notes containing thoughts on how to exploit PMIx-provided information, especially covering aspects that might not be obvious (e.g., attributes that might not be widely known). This first note covers the topic of collective optimization.
PMIx provides network-related information that can be used in construction of collectives - in this case, hierarchical collectives that minimize cross-switch communications. Several pieces of information that might help with construction of such collectives are provided by PMIx at time of process execution. These include: * PMIX_LOCAL_PEERS - the list of local peers (i.e., procs from your nspace) sharing your node. This can be used to aggregate the contribution from participating procs on the node to (for example) the lowest rank participator on that node (call this the "node leader"). * PMIX_SWITCH_PEERS - the list of peers that share the same switch as the proc specified in the call to PMIx_Get. Multi-NIC environments will return an array of results, each element containing the NIC and the list of peers sharing the switch to which that NIC is connected. This can be used to aggregate the contribution across switches - e.g., by having the lowest ranked participating proc on each switch participate in an allgather, and then distribute the results to the participating node leaders for final distribution across their nodes. In the case of non-flat fabrics, further information regarding the topology of the fabric and the location of each proc within that topology is provided to aid in the construction of a collective. These include: * PMIX_NETWORK_COORDINATE - network coordinate of the specified process in the given view type (e.g., logical vs physical), expressed as a pmix_coord_t struct that contains both the coordinates and the number of dimensions * PMIX_NETWORK_VIEW - Requested view type (e.g., logical vs physical) * PMIX_NETWORK_DIMS - Number of dimensions in the specified network plane/view In addition, there are some values that can aid in interpreting this info and/or describing it (e.g., in diagnostic output): * PMIX_NETWORK_PLANE - string ID of a network plane * PMIX_NETWORK_SWITCH - string ID of a network switch * PMIX_NETWORK_NIC - string ID of a NIC * PMIX_NETWORK_SHAPE - number of interfaces (uint32_t) on each dimension of the specified network plane in the requested view * PMIX_NETWORK_SHAPE_STRING - network shape expressed as a string (e.g., "10x12x2") Obviously, the availability of this support depends directly on access to the required information. In the case of managed fabrics, this is provided by PMIx plugins that directly obtain it from the respective fabric manager. I am writing the support for Cray's Slingshot fabric, but any managed fabric can be supported should someone wish to do so. Unmanaged fabrics pose a bit of a challenge (e.g., how does one determine who shares your switch?), but I suspect those who understand those environments can probably devise a solution should they choose to pursue it. Remember, PMIx includes interfaces that allow the daemon-level PMIx servers to collect any information the fabric plugins deem useful from either the fabric or local node level and roll it up for later use - this allows us, for example, to provide the fabric support plugins with information on the local locality of NICs on each node which they then use in assigning network endpoints. This support will be appearing in PMIx (and thus, in OMPI) starting this summer. You can play with it now, if you like - there are a couple of test examples in the PMIx code base (see src/mca/pnet) that provide simulated values being used by our early adopters for development. You are welcome to use those, or to write your own plugin. As always, I'm happy to provide advice/help to those interested in utilizing these capabilities. Ralph