I have a patch that I think will resolve this problem - would you please take a look?

Ralph

Attachment: matias.diff
Description: Binary data


On Apr 15, 2016, at 7:32 AM, Ralph Castain <r...@open-mpi.org> wrote:

Actually, it did come across the developer list :-)

Why don’t I resolve this by just ensuring that the key we create is properly filled? It’s a trivial fix in the PMI ess component


On Apr 15, 2016, at 7:26 AM, Howard Pritchard <hpprit...@gmail.com> wrote:

I didn't copy dev on this.



---------- Weitergeleitete Nachricht ----------
Von: Howard Pritchard <hpprit...@gmail.com>
Datum: Donnerstag, 14. April 2016
Betreff: psm2 and psm2_ep_open problems
An: Open MPI Developers <de...@open-mpi.org>


Hi Matias

Actually I triaged this further.  Open mpi PMI subsystem is actually doing things correctly wrt env variable setting with or without mpi run.  The problem has to do with a psm2  and the fact that on my cluster right now SLURM has only scheduled about 25 jobs.  This results in the unique key PSM2 Mtl is feeding to PSM2 has lots of zeros inthe initial part of the key.  This ends up messing up the epid generated in PSM2.  OFI MTL doesn't have this problem because the PSM2 provider has some of these LSBs set in the value it passes to PSM2.

I will open a PR to "fix" the PSM2MTL to handle this feature of PSM2.

Howard

Am Donnerstag, 14. April 2016 schrieb Cabral, Matias A :

Hi Howard,

 

I suspect this is the known issue that when using SLURM with OMPI and PSM that is discussed here:

https://www.open-mpi.org/community/lists/users/2010/12/15220.php

 

As per today, orte generates the psm_key, so when using SLURM this does not happen and is necessary to set it in the environment.  Here Ralph explains the workaround:

https://www.open-mpi.org/community/lists/users/2010/12/15242.php

 

As you found, epid of 0 is not a valid value. So, basing comments on:

https://github.com/01org/opa-psm2/blob/master/psm_ep.c

 

the assert of line 832. psmi_ep_open_device()  will do :

 

                            /*

                                * We use a LID of 0 for non-HFI communication.

                                * Since a jobkey is not available from IPS, pull the

                                * first 16 bits from the UUID.

                                */

 

                                *epid = PSMI_EPID_PACK(((uint16_t *) unique_job_key)[0],

                                                                       (rank >> 3), rank, 0,

                                                                       PSMI_HFI_TYPE_DEFAULT, rank);

 In the particular case you mention below, when there is no HFI (shared memory), rank 0 and the passed key is 0, epid will be 0.   

 

SOLUTION: set

Set in the environment OMPI_MCA_orte_precondition_transports with a value different than 0.

 

Thanks,

 

_MAC

 

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard
Sent: Thursday, April 14, 2016 1:10 PM
To: Open MPI Developers List <de...@open-mpi.org>
Subject: [OMPI devel] psm2 and psm2_ep_open problems

 

Hi Folks,

 

So we have this brand-new omnipath cluster here at work,

but people are having problem using it on a single node using

srun as the job launcher.

 

The customer wants to use srun to launch jobs not the open mpi

mpirun.  

 

The customer installed 1.10.1, but I can reproduce the

problem with v2.x and I'm sure with master, unless I build the

ofi mtl.  ofi mtl works, psm2 mtl doesn't.

 

I downloaded the psm2 code from github and started hacking.

 

What appears to be the problem is that when running on a single 

node one can go through a path in psmi_ep_open_device where

for a single process job, the value stored into epid is zero.

 

This results in an assert failing in the __psm2_ep_open_internal

function.

 

Is there a quick and dirty workaround that doesn't involve fixing

psm2 MTL?  I could suggest to the sysadmins to install libfabric 1.3

and build the openmpi to only have ofi mtl, but perhaps there's

another way to get psm2 mtl to work for single node jobs?  I'd prefer

to not ask users to disable psm2 mtl explicitly for their single node jobs.

 

Thanks for suggestions.

 

Howard

 

 

 


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2016/04/18773.php


Reply via email to