Copying libfabric-users mailing list on this message. Daniel, would you be able to join an ofiwg call to discuss these in more detail? The calls are every other Tuesday from 9-10 PST, with the next call on Tuesday the 6th.
- Sean > We work with HPC systems that deploy same but multiple network > adapters (including Intel OmniPath and MLX infiniband adapters) on > compute nodes. > > Over time, we encountered two issues which we believe can be addressed > by OFI library. > > First, a number of MPI implementations assume homogenous SW/HW setup > on all compute nodes. For example, assume nodes with 2 adapters and 2 > separate networks. Some MPI implementations assume that network > adapter A resides on CPU socket 0 on all nodes and connect to network > 0; and network adapter B resides on CPU socket 1 and connect to > network 1. Unfortunately that is not always the case. There are > systems where some nodes use adapter A to connect to network 0 and > others use adapter B to connect to network 0. Same for network 1, > where we have mixed (crossed) adapters connected to same network. In > such cases, MPII and lower layers cannot establish peer to peer > connection. The best way to solve this is to use the network subnet > ID to establish connection between pairs. When there are multiple > networks and subnetwork IDs, mpirun would specify a network ID > (Platform MPI does this) and then the software can figure out from the > subnet ID what adapter each node is using to connect to such network. > Instead of implementing this logic in each MPI, it would be great if > OFI implements this logic since it is a one stop shop over all network > devices and providers. > > Second, multirail support is a hit and miss across MPI > implementations. Intel Omnipath PSM2 library actually did a great job > here by implementing multirail support at the PSM2 level. This means > all above layers like MPI would get this functionality for free. > Again, given that many MPI implementation can be built on top of OFI, > It would be also great if OFI has multirail support. > > Thank you > Daniel Faraj _______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
