please point me to the patch. ----------
sent from my smart phonr so no good type. Howard On Apr 15, 2016 1:04 PM, "Ralph Castain" <r...@open-mpi.org> wrote: > I have a patch that I think will resolve this problem - would you please > take a look? > > Ralph > > > > On Apr 15, 2016, at 7:32 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Actually, it did come across the developer list :-) > > Why don’t I resolve this by just ensuring that the key we create is > properly filled? It’s a trivial fix in the PMI ess component > > > On Apr 15, 2016, at 7:26 AM, Howard Pritchard <hpprit...@gmail.com> wrote: > > I didn't copy dev on this. > > > > ---------- Weitergeleitete Nachricht ---------- > Von: *Howard Pritchard* <hpprit...@gmail.com> > Datum: Donnerstag, 14. April 2016 > Betreff: psm2 and psm2_ep_open problems > An: Open MPI Developers <de...@open-mpi.org> > > > Hi Matias > > Actually I triaged this further. Open mpi PMI subsystem is actually doing > things correctly wrt env variable setting with or without mpi run. The > problem has to do with a psm2 and the fact that on my cluster right now > SLURM has only scheduled about 25 jobs. This results in the unique key > PSM2 Mtl is feeding to PSM2 has lots of zeros inthe initial part of the > key. This ends up messing up the epid generated in PSM2. OFI MTL doesn't > have this problem because the PSM2 provider has some of these LSBs set in > the value it passes to PSM2. > > I will open a PR to "fix" the PSM2MTL to handle this feature of PSM2. > > Howard > > Am Donnerstag, 14. April 2016 schrieb Cabral, Matias A : > >> Hi Howard, >> >> >> >> I suspect this is the known issue that when using SLURM with OMPI and PSM >> that is discussed here: >> >> https://www.open-mpi.org/community/lists/users/2010/12/15220.php >> >> >> >> As per today, orte generates the psm_key, so when using SLURM this does >> not happen and is necessary to set it in the environment. Here Ralph >> explains the workaround: >> >> https://www.open-mpi.org/community/lists/users/2010/12/15242.php >> >> >> >> As you found, epid of 0 is not a valid value. So, basing comments on: >> >> https://github.com/01org/opa-psm2/blob/master/psm_ep.c >> >> >> >> the assert of line 832. psmi_ep_open_device() will do : >> >> >> >> /* >> >> * We use a LID of 0 for non-HFI >> communication. >> >> * Since a jobkey is not available from >> IPS, pull the >> >> * first 16 bits from the UUID. >> >> */ >> >> >> >> *epid = PSMI_EPID_PACK(((uint16_t *) >> unique_job_key)[0], >> >> (rank >> >> 3), rank, 0, >> >> >> PSMI_HFI_TYPE_DEFAULT, >> rank); >> >> In the particular case you mention below, when there is no HFI (shared >> memory), rank 0 and the passed key is 0, epid will be 0. >> >> >> >> SOLUTION: set >> >> Set in the environment OMPI_MCA_orte_precondition_transports with a value >> different than 0. >> >> >> >> Thanks, >> >> >> >> _MAC >> >> >> >> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard >> Pritchard >> *Sent:* Thursday, April 14, 2016 1:10 PM >> *To:* Open MPI Developers List <de...@open-mpi.org> >> *Subject:* [OMPI devel] psm2 and psm2_ep_open problems >> >> >> >> Hi Folks, >> >> >> >> So we have this brand-new omnipath cluster here at work, >> >> but people are having problem using it on a single node using >> >> srun as the job launcher. >> >> >> >> The customer wants to use srun to launch jobs not the open mpi >> >> mpirun. >> >> >> >> The customer installed 1.10.1, but I can reproduce the >> >> problem with v2.x and I'm sure with master, unless I build the >> >> ofi mtl. ofi mtl works, psm2 mtl doesn't. >> >> >> >> I downloaded the psm2 code from github and started hacking. >> >> >> >> What appears to be the problem is that when running on a single >> >> node one can go through a path in psmi_ep_open_device where >> >> for a single process job, the value stored into epid is zero. >> >> >> >> This results in an assert failing in the __psm2_ep_open_internal >> >> function. >> >> >> >> Is there a quick and dirty workaround that doesn't involve fixing >> >> psm2 MTL? I could suggest to the sysadmins to install libfabric 1.3 >> >> and build the openmpi to only have ofi mtl, but perhaps there's >> >> another way to get psm2 mtl to work for single node jobs? I'd prefer >> >> to not ask users to disable psm2 mtl explicitly for their single node >> jobs. >> >> >> >> Thanks for suggestions. >> >> >> >> Howard >> >> >> >> >> >> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18773.php > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18776.php >