Actually, it did come across the developer list :-) Why don’t I resolve this by just ensuring that the key we create is properly filled? It’s a trivial fix in the PMI ess component
> On Apr 15, 2016, at 7:26 AM, Howard Pritchard <hpprit...@gmail.com> wrote: > > I didn't copy dev on this. > > > > ---------- Weitergeleitete Nachricht ---------- > Von: Howard Pritchard <hpprit...@gmail.com <mailto:hpprit...@gmail.com>> > Datum: Donnerstag, 14. April 2016 > Betreff: psm2 and psm2_ep_open problems > An: Open MPI Developers <de...@open-mpi.org <mailto:de...@open-mpi.org>> > > > Hi Matias > > Actually I triaged this further. Open mpi PMI subsystem is actually doing > things correctly wrt env variable setting with or without mpi run. The > problem has to do with a psm2 and the fact that on my cluster right now > SLURM has only scheduled about 25 jobs. This results in the unique key PSM2 > Mtl is feeding to PSM2 has lots of zeros inthe initial part of the key. This > ends up messing up the epid generated in PSM2. OFI MTL doesn't have this > problem because the PSM2 provider has some of these LSBs set in the value it > passes to PSM2. > > I will open a PR to "fix" the PSM2MTL to handle this feature of PSM2. > > Howard > > Am Donnerstag, 14. April 2016 schrieb Cabral, Matias A : > Hi Howard, <> > > > I suspect this is the known issue that when using SLURM with OMPI and PSM > that is discussed here: > > https://www.open-mpi.org/community/lists/users/2010/12/15220.php > <https://www.open-mpi.org/community/lists/users/2010/12/15220.php> > > > As per today, orte generates the psm_key, so when using SLURM this does not > happen and is necessary to set it in the environment. Here Ralph explains > the workaround: > > https://www.open-mpi.org/community/lists/users/2010/12/15242.php > <https://www.open-mpi.org/community/lists/users/2010/12/15242.php> > > > As you found, epid of 0 is not a valid value. So, basing comments on: > > https://github.com/01org/opa-psm2/blob/master/psm_ep.c > <https://github.com/01org/opa-psm2/blob/master/psm_ep.c> > > > the assert of line 832. psmi_ep_open_device() will do : > > > > /* > > * We use a LID of 0 for non-HFI communication. > > * Since a jobkey is not available from IPS, > pull the > > * first 16 bits from the UUID. > > */ > > > > *epid = PSMI_EPID_PACK(((uint16_t *) > unique_job_key)[0], > > (rank > >> 3), rank, 0, > > > PSMI_HFI_TYPE_DEFAULT, rank); > > > In the particular case you mention below, when there is no HFI (shared > memory), rank 0 and the passed key is 0, epid will be 0. > > > > SOLUTION: set > > Set in the environment OMPI_MCA_orte_precondition_transports with a value > different than 0. > > > > Thanks, > > > > _MAC > > > > From: devel [mailto:devel-boun...@open-mpi.org <>] On Behalf Of Howard > Pritchard > Sent: Thursday, April 14, 2016 1:10 PM > To: Open MPI Developers List <de...@open-mpi.org <>> > Subject: [OMPI devel] psm2 and psm2_ep_open problems > > > > Hi Folks, > > > > So we have this brand-new omnipath cluster here at work, > > but people are having problem using it on a single node using > > srun as the job launcher. > > > > The customer wants to use srun to launch jobs not the open mpi > > mpirun. > > > > The customer installed 1.10.1, but I can reproduce the > > problem with v2.x and I'm sure with master, unless I build the > > ofi mtl. ofi mtl works, psm2 mtl doesn't. > > > > I downloaded the psm2 code from github and started hacking. > > > > What appears to be the problem is that when running on a single > > node one can go through a path in psmi_ep_open_device where > > for a single process job, the value stored into epid is zero. > > > > This results in an assert failing in the __psm2_ep_open_internal > > function. > > > > Is there a quick and dirty workaround that doesn't involve fixing > > psm2 MTL? I could suggest to the sysadmins to install libfabric 1.3 > > and build the openmpi to only have ofi mtl, but perhaps there's > > another way to get psm2 mtl to work for single node jobs? I'd prefer > > to not ask users to disable psm2 mtl explicitly for their single node jobs. > > > > Thanks for suggestions. > > > > Howard > > > > > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18773.php > <http://www.open-mpi.org/community/lists/devel/2016/04/18773.php>