Hi Matias, My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c. If I disable the shared memory device using the PSM2_DEVICES option it looks like psm2 is unhappy:
kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be reached): [kit001.localdomain:08222] kit001 [kit001.localdomain:08222] PSM2 EP connect error (unknown connect error): [kit001.localdomain:08222] kit001 psm2_ep_connect returned 41 [kit001.localdomain:08221] PSM2 EP connect error (unknown connect error): [kit001.localdomain:08221] kit001 [kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be reached): [kit001.localdomain:08221] kit001 leaving ompi_mtl_psm2_add_procs nprocs 2 I went back and tried again with the OFI MTL (without the PSM2_DEVICES set) and that works correctly on a single node. I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM specific problem. 2016-04-19 12:25 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>: > Hi Howard, > > > > Couple more questions to understand a little better the context: > > - What type of job running? > > - Is this also under srun? > > > > For PSM2 you may find more details in the programmer’s guide: > > > http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf > > > > To disable shared memory: > > Section 2.7.1: > > PSM2_DEVICES="self,fi" > > > > Thanks, > > _MAC > > > > *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Tuesday, April 19, 2016 11:04 AM > *To:* Open MPI Developers List <de...@open-mpi.org> > *Subject:* [OMPI devel] PSM2 Intel folks question > > > > Hi Folks, > > > > I'm making progress with issue #1559 (patches on the mail list didn't > help), > > and I'll open a PR to help the PSM2 MTL work on a single node, but I'm > > noticing something more troublesome. > > > > If I run on just one node, and I use more than one process, process zero > > consistently hangs in psm2_ep_connect. > > > > I've tried using the psm2 code on github - at sha e951cf31, but I still see > > the same behavior. > > > > The PSM2 related rpms installed on our system are: > > > > infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64 > > hfi1-*psm*-0.7-221.ch6.x86_64 > > hfi1-*psm*-devel-0.7-221.ch6.x86_64 > > infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64 > > should we get newer rpms installed? > > > > Is there a way to disable the AMSHM path? I'm wondering if that > > would help since multi-node jobs seems to run fine. > > > > Thanks for any help, > > > > Howard > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18783.php >