Howard, PSM2_DEVICES, I went back to the roots and found that shm is the only device supporting communication between ranks in the same node. Therefore, the below error “Endpoint could not be reached” would be expected.
Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have from github and have hello_c and ring_c running with 80 ranks on a local node using PSM2 mtl. I do not have any SLURM setup on my system. I will proceed to setup SLURM to see if I can reproduce the issue with it. In the meantime please share any extra detail you find relevant. Thanks, _MAC From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard Sent: Tuesday, April 19, 2016 12:21 PM To: Open MPI Developers <de...@open-mpi.org> Subject: Re: [OMPI devel] PSM2 Intel folks question Hi Matias, My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c. If I disable the shared memory device using the PSM2_DEVICES option it looks like psm2 is unhappy: kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be reached): [kit001.localdomain:08222] kit001 [kit001.localdomain:08222] PSM2 EP connect error (unknown connect error): [kit001.localdomain:08222] kit001 psm2_ep_connect returned 41 [kit001.localdomain:08221] PSM2 EP connect error (unknown connect error): [kit001.localdomain:08221] kit001 [kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be reached): [kit001.localdomain:08221] kit001 leaving ompi_mtl_psm2_add_procs nprocs 2 I went back and tried again with the OFI MTL (without the PSM2_DEVICES set) and that works correctly on a single node. I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM specific problem. 2016-04-19 12:25 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>>: Hi Howard, Couple more questions to understand a little better the context: - What type of job running? - Is this also under srun? For PSM2 you may find more details in the programmer’s guide: http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf To disable shared memory: Section 2.7.1: PSM2_DEVICES="self,fi" Thanks, _MAC From: devel [mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] On Behalf Of Howard Pritchard Sent: Tuesday, April 19, 2016 11:04 AM To: Open MPI Developers List <de...@open-mpi.org<mailto:de...@open-mpi.org>> Subject: [OMPI devel] PSM2 Intel folks question Hi Folks, I'm making progress with issue #1559 (patches on the mail list didn't help), and I'll open a PR to help the PSM2 MTL work on a single node, but I'm noticing something more troublesome. If I run on just one node, and I use more than one process, process zero consistently hangs in psm2_ep_connect. I've tried using the psm2 code on github - at sha e951cf31, but I still see the same behavior. The PSM2 related rpms installed on our system are: infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64 hfi1-psm-0.7-221.ch6.x86_64 hfi1-psm-devel-0.7-221.ch6.x86_64 infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64 should we get newer rpms installed? Is there a way to disable the AMSHM path? I'm wondering if that would help since multi-node jobs seems to run fine. Thanks for any help, Howard _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/04/18783.php