In my non-expert opinion, OFI is already providing the right abstraction for multi-rail situations in the form of domains:
"Domains usually map to a specific local network interface adapter. A domain may either refer to the entire NIC, a port on a multi-port NIC, or a virtual device exposed by a NIC. From the viewpoint of the application, a domain identifies a set of resources that may be used together." ( https://github.com/ofiwg/ofi-guide/blob/master/OFIGuide.md) >From this, MPI libraries and the like would then need to support multiple domains. Jeff On Fri, Jun 2, 2017 at 12:21 PM, Hefty, Sean <[email protected]> wrote: > > Copying libfabric-users mailing list on this message. > > Daniel, would you be able to join an ofiwg call to discuss these in more detail? The calls are every other Tuesday from 9-10 PST, with the next call on Tuesday the 6th. > > - Sean > > > We work with HPC systems that deploy same but multiple network > > adapters (including Intel OmniPath and MLX infiniband adapters) on > > compute nodes. > > > > Over time, we encountered two issues which we believe can be addressed > > by OFI library. > > > > First, a number of MPI implementations assume homogenous SW/HW setup > > on all compute nodes. For example, assume nodes with 2 adapters and 2 > > separate networks. Some MPI implementations assume that network > > adapter A resides on CPU socket 0 on all nodes and connect to network > > 0; and network adapter B resides on CPU socket 1 and connect to > > network 1. Unfortunately that is not always the case. There are > > systems where some nodes use adapter A to connect to network 0 and > > others use adapter B to connect to network 0. Same for network 1, > > where we have mixed (crossed) adapters connected to same network. In > > such cases, MPII and lower layers cannot establish peer to peer > > connection. The best way to solve this is to use the network subnet > > ID to establish connection between pairs. When there are multiple > > networks and subnetwork IDs, mpirun would specify a network ID > > (Platform MPI does this) and then the software can figure out from the > > subnet ID what adapter each node is using to connect to such network. > > Instead of implementing this logic in each MPI, it would be great if > > OFI implements this logic since it is a one stop shop over all network > > devices and providers. > > > > Second, multirail support is a hit and miss across MPI > > implementations. Intel Omnipath PSM2 library actually did a great job > > here by implementing multirail support at the PSM2 level. This means > > all above layers like MPI would get this functionality for free. > > Again, given that many MPI implementation can be built on top of OFI, > > It would be also great if OFI has multirail support. > > > > Thank you > > Daniel Faraj > _______________________________________________ > Libfabric-users mailing list > [email protected] > http://lists.openfabrics.org/mailman/listinfo/libfabric-users -- Jeff Hammond [email protected] http://jeffhammond.github.io/
_______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
