please point me to the patch.

----------

sent from my smart phonr so no good type.

Howard
On Apr 15, 2016 1:04 PM, "Ralph Castain" <r...@open-mpi.org> wrote:

> I have a patch that I think will resolve this problem - would you please
> take a look?
>
> Ralph
>
>
>
> On Apr 15, 2016, at 7:32 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> Actually, it did come across the developer list :-)
>
> Why don’t I resolve this by just ensuring that the key we create is
> properly filled? It’s a trivial fix in the PMI ess component
>
>
> On Apr 15, 2016, at 7:26 AM, Howard Pritchard <hpprit...@gmail.com> wrote:
>
> I didn't copy dev on this.
>
>
>
> ---------- Weitergeleitete Nachricht ----------
> Von: *Howard Pritchard* <hpprit...@gmail.com>
> Datum: Donnerstag, 14. April 2016
> Betreff: psm2 and psm2_ep_open problems
> An: Open MPI Developers <de...@open-mpi.org>
>
>
> Hi Matias
>
> Actually I triaged this further.  Open mpi PMI subsystem is actually doing
> things correctly wrt env variable setting with or without mpi run.  The
> problem has to do with a psm2  and the fact that on my cluster right now
> SLURM has only scheduled about 25 jobs.  This results in the unique key
> PSM2 Mtl is feeding to PSM2 has lots of zeros inthe initial part of the
> key.  This ends up messing up the epid generated in PSM2.  OFI MTL doesn't
> have this problem because the PSM2 provider has some of these LSBs set in
> the value it passes to PSM2.
>
> I will open a PR to "fix" the PSM2MTL to handle this feature of PSM2.
>
> Howard
>
> Am Donnerstag, 14. April 2016 schrieb Cabral, Matias A :
>
>> Hi Howard,
>>
>>
>>
>> I suspect this is the known issue that when using SLURM with OMPI and PSM
>> that is discussed here:
>>
>> https://www.open-mpi.org/community/lists/users/2010/12/15220.php
>>
>>
>>
>> As per today, orte generates the psm_key, so when using SLURM this does
>> not happen and is necessary to set it in the environment.  Here Ralph
>> explains the workaround:
>>
>> https://www.open-mpi.org/community/lists/users/2010/12/15242.php
>>
>>
>>
>> As you found, epid of 0 is not a valid value. So, basing comments on:
>>
>> https://github.com/01org/opa-psm2/blob/master/psm_ep.c
>>
>>
>>
>> the assert of line 832. psmi_ep_open_device()  will do :
>>
>>
>>
>>                             /*
>>
>>                                 * We use a LID of 0 for non-HFI
>> communication.
>>
>>                                 * Since a jobkey is not available from
>> IPS, pull the
>>
>>                                 * first 16 bits from the UUID.
>>
>>                                 */
>>
>>
>>
>>                                 *epid = PSMI_EPID_PACK(((uint16_t *)
>> unique_job_key)[0],
>>
>>                                                                        (rank
>> >> 3), rank, 0,
>>
>>                                                                        
>> PSMI_HFI_TYPE_DEFAULT,
>> rank);
>>
>>  In the particular case you mention below, when there is no HFI (shared
>> memory), rank 0 and the passed key is 0, epid will be 0.
>>
>>
>>
>> SOLUTION: set
>>
>> Set in the environment OMPI_MCA_orte_precondition_transports with a value
>> different than 0.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> _MAC
>>
>>
>>
>> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
>> Pritchard
>> *Sent:* Thursday, April 14, 2016 1:10 PM
>> *To:* Open MPI Developers List <de...@open-mpi.org>
>> *Subject:* [OMPI devel] psm2 and psm2_ep_open problems
>>
>>
>>
>> Hi Folks,
>>
>>
>>
>> So we have this brand-new omnipath cluster here at work,
>>
>> but people are having problem using it on a single node using
>>
>> srun as the job launcher.
>>
>>
>>
>> The customer wants to use srun to launch jobs not the open mpi
>>
>> mpirun.
>>
>>
>>
>> The customer installed 1.10.1, but I can reproduce the
>>
>> problem with v2.x and I'm sure with master, unless I build the
>>
>> ofi mtl.  ofi mtl works, psm2 mtl doesn't.
>>
>>
>>
>> I downloaded the psm2 code from github and started hacking.
>>
>>
>>
>> What appears to be the problem is that when running on a single
>>
>> node one can go through a path in psmi_ep_open_device where
>>
>> for a single process job, the value stored into epid is zero.
>>
>>
>>
>> This results in an assert failing in the __psm2_ep_open_internal
>>
>> function.
>>
>>
>>
>> Is there a quick and dirty workaround that doesn't involve fixing
>>
>> psm2 MTL?  I could suggest to the sysadmins to install libfabric 1.3
>>
>> and build the openmpi to only have ofi mtl, but perhaps there's
>>
>> another way to get psm2 mtl to work for single node jobs?  I'd prefer
>>
>> to not ask users to disable psm2 mtl explicitly for their single node
>> jobs.
>>
>>
>>
>> Thanks for suggestions.
>>
>>
>>
>> Howard
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18773.php
>
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18776.php
>

Reply via email to