Hi Howard,

I’ve been playing with the same version of psm (hfi1-psm-0.7-221.ch6.x86_64) 
but cannot yet reproduce the issue.  Just in case, please share the version of 
the driver you have installed (hfi1-X.XX-XX.x86_64.rpm, modinfo hfi1).

What I can tell so far, is that I still suspect this has  some relation to the 
job_id, that OMPI uses to generate the unique job key, that psm uses to 
generate the epid. By looking at the logfile.busted, I see some entries for 
‘epid 10000’. This can only happen if psm2_ep_open() is called with a unique 
job key of 1 and having the PSM2 hfi device disabled (only shm communication 
expected). In your workaround (hfi enabled) the epid generation goes through a 
different path that includes the HFI LID which ends with different number.  
HOWEVER, I hardcoded the above (to get epid 10000) case but I still see the 
hello_c running with stock OMPI 1.10.2.

Would you please try forcing different jobid and share the results?

Thanks,

_MAC


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard
Sent: Wednesday, April 20, 2016 8:49 AM
To: Open MPI Developers <de...@open-mpi.org>
Subject: Re: [OMPI devel] PSM2 Intel folks question

HI Matias,

Actually I found the problem.  I kept wondering why the OFI MTL works fine, but 
the
PSM2 MTL doesn't.  When I cranked up the debugging level I noticed that for OFI 
MTL,
it doesn't mess with the PSM2_DEVICES env variable.  So the PSM2 tries all three
"devices" as part of initialization.  However, the PSM2 MTL sets the 
PSM2_DEVICES
to not include hfi.  If I comment out those lines of code in the PSM2 MTL, my 
one-node
problem vanishes.

I suspect there's some setup code when "initializing" the hfi device that is 
actually
required even when using the shm device for on-node messages.

Is there by an chance some psm2 device driver parameter setting that might
result in this behavior.

Anyway, I set PSM2_TRACEMASK to 0xFFFF and got a bunch of output that
might be helpful.  I attached the log files to issue 1559.

For now, I will open a PR with fixes to get the PSM2 MTL working on our
omnipath clusters.

I don't think this problem has anything to do with SLURM except for the jobid
manipulation to generate the unique key.

Howard


2016-04-19 17:18 GMT-06:00 Cabral, Matias A 
<matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>>:
Howard,

PSM2_DEVICES, I went back to the roots and found that shm is the only device 
supporting communication between ranks in the same node. Therefore, the below 
error “Endpoint could not be reached” would be expected.

Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have from 
github and have hello_c and ring_c running with 80 ranks on a local node using 
PSM2 mtl. I do not have any SLURM setup on my system.  I will proceed to setup 
SLURM to see if I can reproduce the issue with it. In the meantime please share 
any extra detail you find relevant.

Thanks,

_MAC

From: devel 
[mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] On 
Behalf Of Howard Pritchard
Sent: Tuesday, April 19, 2016 12:21 PM
To: Open MPI Developers <de...@open-mpi.org<mailto:de...@open-mpi.org>>
Subject: Re: [OMPI devel] PSM2 Intel folks question

Hi Matias,

My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
If I disable the shared memory device using the PSM2_DEVICES option
it looks like psm2 is unhappy:


kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be reached):
[kit001.localdomain:08222]  kit001
[kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):
[kit001.localdomain:08222]  kit001
 psm2_ep_connect returned 41
[kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):
[kit001.localdomain:08221]  kit001
[kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be 
reached):
[kit001.localdomain:08221]  kit001
leaving ompi_mtl_psm2_add_procs nprocs 2

I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
and that works correctly on a single node.
I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
specific problem.

2016-04-19 12:25 GMT-06:00 Cabral, Matias A 
<matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>>:
Hi Howard,

Couple more questions to understand a little better the context:

-          What type of job running?

-          Is this also under srun?

For PSM2 you may find more details in the programmer’s guide:
http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf

To disable shared memory:
Section 2.7.1:
PSM2_DEVICES="self,fi"

Thanks,
_MAC

From: devel 
[mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] On 
Behalf Of Howard Pritchard
Sent: Tuesday, April 19, 2016 11:04 AM
To: Open MPI Developers List <de...@open-mpi.org<mailto:de...@open-mpi.org>>
Subject: [OMPI devel] PSM2 Intel folks question

Hi Folks,

I'm making progress with issue #1559 (patches on the mail list didn't help),
and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
noticing something more troublesome.

If I run on just one node, and I use more than one process, process zero
consistently hangs in psm2_ep_connect.

I've tried using the psm2 code on github - at sha e951cf31, but I still see
the same behavior.

The PSM2 related rpms installed on our system are:

infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
hfi1-psm-0.7-221.ch6.x86_64
hfi1-psm-devel-0.7-221.ch6.x86_64
infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64
should we get newer rpms installed?

Is there a way to disable the AMSHM path?  I'm wondering if that
would help since multi-node jobs seems to run fine.

Thanks for any help,

Howard


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18783.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18787.php

Reply via email to