Howard,

PSM2_DEVICES, I went back to the roots and found that shm is the only device 
supporting communication between ranks in the same node. Therefore, the below 
error “Endpoint could not be reached” would be expected.

Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have from 
github and have hello_c and ring_c running with 80 ranks on a local node using 
PSM2 mtl. I do not have any SLURM setup on my system.  I will proceed to setup 
SLURM to see if I can reproduce the issue with it. In the meantime please share 
any extra detail you find relevant.

Thanks,

_MAC

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, April 19, 2016 12:21 PM
To: Open MPI Developers <de...@open-mpi.org>
Subject: Re: [OMPI devel] PSM2 Intel folks question

Hi Matias,

My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
If I disable the shared memory device using the PSM2_DEVICES option
it looks like psm2 is unhappy:


kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be reached):
[kit001.localdomain:08222]  kit001
[kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):
[kit001.localdomain:08222]  kit001
 psm2_ep_connect returned 41
[kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):
[kit001.localdomain:08221]  kit001
[kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be 
reached):
[kit001.localdomain:08221]  kit001
leaving ompi_mtl_psm2_add_procs nprocs 2

I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
and that works correctly on a single node.
I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
specific problem.

2016-04-19 12:25 GMT-06:00 Cabral, Matias A 
<matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>>:
Hi Howard,

Couple more questions to understand a little better the context:

-          What type of job running?

-          Is this also under srun?

For PSM2 you may find more details in the programmer’s guide:
http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf

To disable shared memory:
Section 2.7.1:
PSM2_DEVICES="self,fi"

Thanks,
_MAC

From: devel 
[mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] On 
Behalf Of Howard Pritchard
Sent: Tuesday, April 19, 2016 11:04 AM
To: Open MPI Developers List <de...@open-mpi.org<mailto:de...@open-mpi.org>>
Subject: [OMPI devel] PSM2 Intel folks question

Hi Folks,

I'm making progress with issue #1559 (patches on the mail list didn't help),
and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
noticing something more troublesome.

If I run on just one node, and I use more than one process, process zero
consistently hangs in psm2_ep_connect.

I've tried using the psm2 code on github - at sha e951cf31, but I still see
the same behavior.

The PSM2 related rpms installed on our system are:

infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
hfi1-psm-0.7-221.ch6.x86_64
hfi1-psm-devel-0.7-221.ch6.x86_64
infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64
should we get newer rpms installed?

Is there a way to disable the AMSHM path?  I'm wondering if that
would help since multi-node jobs seems to run fine.

Thanks for any help,

Howard


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18783.php

Reply via email to