Hi Folks, So we have this brand-new omnipath cluster here at work, but people are having problem using it on a single node using srun as the job launcher.
The customer wants to use srun to launch jobs not the open mpi mpirun. The customer installed 1.10.1, but I can reproduce the problem with v2.x and I'm sure with master, unless I build the ofi mtl. ofi mtl works, psm2 mtl doesn't. I downloaded the psm2 code from github and started hacking. What appears to be the problem is that when running on a single node one can go through a path in psmi_ep_open_device where for a single process job, the value stored into epid is zero. This results in an assert failing in the __psm2_ep_open_internal function. Is there a quick and dirty workaround that doesn't involve fixing psm2 MTL? I could suggest to the sysadmins to install libfabric 1.3 and build the openmpi to only have ofi mtl, but perhaps there's another way to get psm2 mtl to work for single node jobs? I'd prefer to not ask users to disable psm2 mtl explicitly for their single node jobs. Thanks for suggestions. Howard