HI Matias, Actually I found the problem. I kept wondering why the OFI MTL works fine, but the PSM2 MTL doesn't. When I cranked up the debugging level I noticed that for OFI MTL, it doesn't mess with the PSM2_DEVICES env variable. So the PSM2 tries all three "devices" as part of initialization. However, the PSM2 MTL sets the PSM2_DEVICES to not include hfi. If I comment out those lines of code in the PSM2 MTL, my one-node problem vanishes.
I suspect there's some setup code when "initializing" the hfi device that is actually required even when using the shm device for on-node messages. Is there by an chance some psm2 device driver parameter setting that might result in this behavior. Anyway, I set PSM2_TRACEMASK to 0xFFFF and got a bunch of output that might be helpful. I attached the log files to issue 1559. For now, I will open a PR with fixes to get the PSM2 MTL working on our omnipath clusters. I don't think this problem has anything to do with SLURM except for the jobid manipulation to generate the unique key. Howard 2016-04-19 17:18 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>: > Howard, > > > > PSM2_DEVICES, I went back to the roots and found that shm is the only > device supporting communication between ranks in the same node. Therefore, > the below error “Endpoint could not be reached” would be expected. > > > > Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have > from github and have hello_c and ring_c running with 80 ranks on a local > node using PSM2 mtl. I do not have any SLURM setup on my system. I will > proceed to setup SLURM to see if I can reproduce the issue with it. In the > meantime please share any extra detail you find relevant. > > > > Thanks, > > > > _MAC > > > > *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Tuesday, April 19, 2016 12:21 PM > *To:* Open MPI Developers <de...@open-mpi.org> > *Subject:* Re: [OMPI devel] PSM2 Intel folks question > > > > Hi Matias, > > > > My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c. > > If I disable the shared memory device using the PSM2_DEVICES option > > it looks like psm2 is unhappy: > > > > > > kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be > reached): > > [kit001.localdomain:08222] kit001 > > [kit001.localdomain:08222] PSM2 EP connect error (unknown connect error): > > [kit001.localdomain:08222] kit001 > > psm2_ep_connect returned 41 > > [kit001.localdomain:08221] PSM2 EP connect error (unknown connect error): > > [kit001.localdomain:08221] kit001 > > [kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be > reached): > > [kit001.localdomain:08221] kit001 > > leaving ompi_mtl_psm2_add_procs nprocs 2 > > > > I went back and tried again with the OFI MTL (without the PSM2_DEVICES set) > and that works correctly on a single node. > > I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM > specific problem. > > > > 2016-04-19 12:25 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>: > > Hi Howard, > > > > Couple more questions to understand a little better the context: > > - What type of job running? > > - Is this also under srun? > > > > For PSM2 you may find more details in the programmer’s guide: > > > http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf > > > > To disable shared memory: > > Section 2.7.1: > > PSM2_DEVICES="self,fi" > > > > Thanks, > > _MAC > > > > *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Tuesday, April 19, 2016 11:04 AM > *To:* Open MPI Developers List <de...@open-mpi.org> > *Subject:* [OMPI devel] PSM2 Intel folks question > > > > Hi Folks, > > > > I'm making progress with issue #1559 (patches on the mail list didn't > help), > > and I'll open a PR to help the PSM2 MTL work on a single node, but I'm > > noticing something more troublesome. > > > > If I run on just one node, and I use more than one process, process zero > > consistently hangs in psm2_ep_connect. > > > > I've tried using the psm2 code on github - at sha e951cf31, but I still see > > the same behavior. > > > > The PSM2 related rpms installed on our system are: > > > > infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64 > > hfi1-*psm*-0.7-221.ch6.x86_64 > > hfi1-*psm*-devel-0.7-221.ch6.x86_64 > > infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64 > > should we get newer rpms installed? > > > > Is there a way to disable the AMSHM path? I'm wondering if that > > would help since multi-node jobs seems to run fine. > > > > Thanks for any help, > > > > Howard > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18783.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18787.php >