Actually, it did come across the developer list :-)

Why don’t I resolve this by just ensuring that the key we create is properly 
filled? It’s a trivial fix in the PMI ess component


> On Apr 15, 2016, at 7:26 AM, Howard Pritchard <hpprit...@gmail.com> wrote:
> 
> I didn't copy dev on this.
> 
> 
> 
> ---------- Weitergeleitete Nachricht ----------
> Von: Howard Pritchard <hpprit...@gmail.com <mailto:hpprit...@gmail.com>>
> Datum: Donnerstag, 14. April 2016
> Betreff: psm2 and psm2_ep_open problems
> An: Open MPI Developers <de...@open-mpi.org <mailto:de...@open-mpi.org>>
> 
> 
> Hi Matias
> 
> Actually I triaged this further.  Open mpi PMI subsystem is actually doing 
> things correctly wrt env variable setting with or without mpi run.  The 
> problem has to do with a psm2  and the fact that on my cluster right now 
> SLURM has only scheduled about 25 jobs.  This results in the unique key PSM2 
> Mtl is feeding to PSM2 has lots of zeros inthe initial part of the key.  This 
> ends up messing up the epid generated in PSM2.  OFI MTL doesn't have this 
> problem because the PSM2 provider has some of these LSBs set in the value it 
> passes to PSM2.
> 
> I will open a PR to "fix" the PSM2MTL to handle this feature of PSM2.
> 
> Howard
> 
> Am Donnerstag, 14. April 2016 schrieb Cabral, Matias A :
> Hi Howard, <>
>  
> 
> I suspect this is the known issue that when using SLURM with OMPI and PSM 
> that is discussed here:
> 
> https://www.open-mpi.org/community/lists/users/2010/12/15220.php 
> <https://www.open-mpi.org/community/lists/users/2010/12/15220.php>
>  
> 
> As per today, orte generates the psm_key, so when using SLURM this does not 
> happen and is necessary to set it in the environment.  Here Ralph explains 
> the workaround:
> 
> https://www.open-mpi.org/community/lists/users/2010/12/15242.php 
> <https://www.open-mpi.org/community/lists/users/2010/12/15242.php>
>  
> 
> As you found, epid of 0 is not a valid value. So, basing comments on:
> 
> https://github.com/01org/opa-psm2/blob/master/psm_ep.c 
> <https://github.com/01org/opa-psm2/blob/master/psm_ep.c>
>  
> 
> the assert of line 832. psmi_ep_open_device()  will do :
> 
>  
> 
>                             /*
> 
>                                 * We use a LID of 0 for non-HFI communication.
> 
>                                 * Since a jobkey is not available from IPS, 
> pull the
> 
>                                 * first 16 bits from the UUID.
> 
>                                 */
> 
>  
> 
>                                 *epid = PSMI_EPID_PACK(((uint16_t *) 
> unique_job_key)[0],
> 
>                                                                        (rank 
> >> 3), rank, 0,
> 
>                                                                        
> PSMI_HFI_TYPE_DEFAULT, rank);
> 
> 
>  In the particular case you mention below, when there is no HFI (shared 
> memory), rank 0 and the passed key is 0, epid will be 0.   
> 
>  
> 
> SOLUTION: set
> 
> Set in the environment OMPI_MCA_orte_precondition_transports with a value 
> different than 0.
> 
>  
> 
> Thanks,
> 
>  
> 
> _MAC
> 
>  
> 
> From: devel [mailto:devel-boun...@open-mpi.org <>] On Behalf Of Howard 
> Pritchard
> Sent: Thursday, April 14, 2016 1:10 PM
> To: Open MPI Developers List <de...@open-mpi.org <>>
> Subject: [OMPI devel] psm2 and psm2_ep_open problems
> 
>  
> 
> Hi Folks,
> 
>  
> 
> So we have this brand-new omnipath cluster here at work,
> 
> but people are having problem using it on a single node using
> 
> srun as the job launcher.
> 
>  
> 
> The customer wants to use srun to launch jobs not the open mpi
> 
> mpirun.  
> 
>  
> 
> The customer installed 1.10.1, but I can reproduce the
> 
> problem with v2.x and I'm sure with master, unless I build the
> 
> ofi mtl.  ofi mtl works, psm2 mtl doesn't.
> 
>  
> 
> I downloaded the psm2 code from github and started hacking.
> 
>  
> 
> What appears to be the problem is that when running on a single 
> 
> node one can go through a path in psmi_ep_open_device where
> 
> for a single process job, the value stored into epid is zero.
> 
>  
> 
> This results in an assert failing in the __psm2_ep_open_internal
> 
> function.
> 
>  
> 
> Is there a quick and dirty workaround that doesn't involve fixing
> 
> psm2 MTL?  I could suggest to the sysadmins to install libfabric 1.3
> 
> and build the openmpi to only have ofi mtl, but perhaps there's
> 
> another way to get psm2 mtl to work for single node jobs?  I'd prefer
> 
> to not ask users to disable psm2 mtl explicitly for their single node jobs.
> 
>  
> 
> Thanks for suggestions.
> 
>  
> 
> Howard
> 
>  
> 
>  
> 
>  
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18773.php 
> <http://www.open-mpi.org/community/lists/devel/2016/04/18773.php>

Reply via email to