Hi Howard, I’ve been playing with the same version of psm (hfi1-psm-0.7-221.ch6.x86_64) but cannot yet reproduce the issue. Just in case, please share the version of the driver you have installed (hfi1-X.XX-XX.x86_64.rpm, modinfo hfi1).
What I can tell so far, is that I still suspect this has some relation to the job_id, that OMPI uses to generate the unique job key, that psm uses to generate the epid. By looking at the logfile.busted, I see some entries for ‘epid 10000’. This can only happen if psm2_ep_open() is called with a unique job key of 1 and having the PSM2 hfi device disabled (only shm communication expected). In your workaround (hfi enabled) the epid generation goes through a different path that includes the HFI LID which ends with different number. HOWEVER, I hardcoded the above (to get epid 10000) case but I still see the hello_c running with stock OMPI 1.10.2. Would you please try forcing different jobid and share the results? Thanks, _MAC From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard Sent: Wednesday, April 20, 2016 8:49 AM To: Open MPI Developers <de...@open-mpi.org> Subject: Re: [OMPI devel] PSM2 Intel folks question HI Matias, Actually I found the problem. I kept wondering why the OFI MTL works fine, but the PSM2 MTL doesn't. When I cranked up the debugging level I noticed that for OFI MTL, it doesn't mess with the PSM2_DEVICES env variable. So the PSM2 tries all three "devices" as part of initialization. However, the PSM2 MTL sets the PSM2_DEVICES to not include hfi. If I comment out those lines of code in the PSM2 MTL, my one-node problem vanishes. I suspect there's some setup code when "initializing" the hfi device that is actually required even when using the shm device for on-node messages. Is there by an chance some psm2 device driver parameter setting that might result in this behavior. Anyway, I set PSM2_TRACEMASK to 0xFFFF and got a bunch of output that might be helpful. I attached the log files to issue 1559. For now, I will open a PR with fixes to get the PSM2 MTL working on our omnipath clusters. I don't think this problem has anything to do with SLURM except for the jobid manipulation to generate the unique key. Howard 2016-04-19 17:18 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>>: Howard, PSM2_DEVICES, I went back to the roots and found that shm is the only device supporting communication between ranks in the same node. Therefore, the below error “Endpoint could not be reached” would be expected. Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have from github and have hello_c and ring_c running with 80 ranks on a local node using PSM2 mtl. I do not have any SLURM setup on my system. I will proceed to setup SLURM to see if I can reproduce the issue with it. In the meantime please share any extra detail you find relevant. Thanks, _MAC From: devel [mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] On Behalf Of Howard Pritchard Sent: Tuesday, April 19, 2016 12:21 PM To: Open MPI Developers <de...@open-mpi.org<mailto:de...@open-mpi.org>> Subject: Re: [OMPI devel] PSM2 Intel folks question Hi Matias, My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c. If I disable the shared memory device using the PSM2_DEVICES option it looks like psm2 is unhappy: kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be reached): [kit001.localdomain:08222] kit001 [kit001.localdomain:08222] PSM2 EP connect error (unknown connect error): [kit001.localdomain:08222] kit001 psm2_ep_connect returned 41 [kit001.localdomain:08221] PSM2 EP connect error (unknown connect error): [kit001.localdomain:08221] kit001 [kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be reached): [kit001.localdomain:08221] kit001 leaving ompi_mtl_psm2_add_procs nprocs 2 I went back and tried again with the OFI MTL (without the PSM2_DEVICES set) and that works correctly on a single node. I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM specific problem. 2016-04-19 12:25 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>>: Hi Howard, Couple more questions to understand a little better the context: - What type of job running? - Is this also under srun? For PSM2 you may find more details in the programmer’s guide: http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf To disable shared memory: Section 2.7.1: PSM2_DEVICES="self,fi" Thanks, _MAC From: devel [mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] On Behalf Of Howard Pritchard Sent: Tuesday, April 19, 2016 11:04 AM To: Open MPI Developers List <de...@open-mpi.org<mailto:de...@open-mpi.org>> Subject: [OMPI devel] PSM2 Intel folks question Hi Folks, I'm making progress with issue #1559 (patches on the mail list didn't help), and I'll open a PR to help the PSM2 MTL work on a single node, but I'm noticing something more troublesome. If I run on just one node, and I use more than one process, process zero consistently hangs in psm2_ep_connect. I've tried using the psm2 code on github - at sha e951cf31, but I still see the same behavior. The PSM2 related rpms installed on our system are: infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64 hfi1-psm-0.7-221.ch6.x86_64 hfi1-psm-devel-0.7-221.ch6.x86_64 infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64 should we get newer rpms installed? Is there a way to disable the AMSHM path? I'm wondering if that would help since multi-node jobs seems to run fine. Thanks for any help, Howard _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/04/18783.php _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/04/18787.php