Hi Howard,

I suspect this is the known issue that when using SLURM with OMPI and PSM that 
is discussed here:
https://www.open-mpi.org/community/lists/users/2010/12/15220.php

As per today, orte generates the psm_key, so when using SLURM this does not 
happen and is necessary to set it in the environment.  Here Ralph explains the 
workaround:
https://www.open-mpi.org/community/lists/users/2010/12/15242.php

As you found, epid of 0 is not a valid value. So, basing comments on:
https://github.com/01org/opa-psm2/blob/master/psm_ep.c

the assert of line 832. psmi_ep_open_device()  will do :

                            /*
                                * We use a LID of 0 for non-HFI communication.
                                * Since a jobkey is not available from IPS, 
pull the
                                * first 16 bits from the UUID.
                                */

                                *epid = PSMI_EPID_PACK(((uint16_t *) 
unique_job_key)[0],
                                                                       (rank >> 
3), rank, 0,
                                                                       
PSMI_HFI_TYPE_DEFAULT, rank);
 In the particular case you mention below, when there is no HFI (shared 
memory), rank 0 and the passed key is 0, epid will be 0.

SOLUTION: set
Set in the environment OMPI_MCA_orte_precondition_transports with a value 
different than 0.

Thanks,

_MAC

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard
Sent: Thursday, April 14, 2016 1:10 PM
To: Open MPI Developers List <de...@open-mpi.org>
Subject: [OMPI devel] psm2 and psm2_ep_open problems

Hi Folks,

So we have this brand-new omnipath cluster here at work,
but people are having problem using it on a single node using
srun as the job launcher.

The customer wants to use srun to launch jobs not the open mpi
mpirun.

The customer installed 1.10.1, but I can reproduce the
problem with v2.x and I'm sure with master, unless I build the
ofi mtl.  ofi mtl works, psm2 mtl doesn't.

I downloaded the psm2 code from github and started hacking.

What appears to be the problem is that when running on a single
node one can go through a path in psmi_ep_open_device where
for a single process job, the value stored into epid is zero.

This results in an assert failing in the __psm2_ep_open_internal
function.

Is there a quick and dirty workaround that doesn't involve fixing
psm2 MTL?  I could suggest to the sysadmins to install libfabric 1.3
and build the openmpi to only have ofi mtl, but perhaps there's
another way to get psm2 mtl to work for single node jobs?  I'd prefer
to not ask users to disable psm2 mtl explicitly for their single node jobs.

Thanks for suggestions.

Howard



Reply via email to