HI Matias,

Actually I found the problem.  I kept wondering why the OFI MTL works fine,
but the
PSM2 MTL doesn't.  When I cranked up the debugging level I noticed that for
OFI MTL,
it doesn't mess with the PSM2_DEVICES env variable.  So the PSM2 tries all
three
"devices" as part of initialization.  However, the PSM2 MTL sets the
PSM2_DEVICES
to not include hfi.  If I comment out those lines of code in the PSM2 MTL,
my one-node
problem vanishes.

I suspect there's some setup code when "initializing" the hfi device that
is actually
required even when using the shm device for on-node messages.

Is there by an chance some psm2 device driver parameter setting that might
result in this behavior.

Anyway, I set PSM2_TRACEMASK to 0xFFFF and got a bunch of output that
might be helpful.  I attached the log files to issue 1559.

For now, I will open a PR with fixes to get the PSM2 MTL working on our
omnipath clusters.

I don't think this problem has anything to do with SLURM except for the
jobid
manipulation to generate the unique key.

Howard


2016-04-19 17:18 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>:

> Howard,
>
>
>
> PSM2_DEVICES, I went back to the roots and found that shm is the only
> device supporting communication between ranks in the same node. Therefore,
> the below error “Endpoint could not be reached” would be expected.
>
>
>
> Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have
> from github and have hello_c and ring_c running with 80 ranks on a local
> node using PSM2 mtl. I do not have any SLURM setup on my system.  I will
> proceed to setup SLURM to see if I can reproduce the issue with it. In the
> meantime please share any extra detail you find relevant.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 12:21 PM
> *To:* Open MPI Developers <de...@open-mpi.org>
> *Subject:* Re: [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Matias,
>
>
>
> My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
>
> If I disable the shared memory device using the PSM2_DEVICES option
>
> it looks like psm2 is unhappy:
>
>
>
>
>
> kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be
> reached):
>
> [kit001.localdomain:08222]  kit001
>
> [kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):
>
> [kit001.localdomain:08222]  kit001
>
>  psm2_ep_connect returned 41
>
> [kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):
>
> [kit001.localdomain:08221]  kit001
>
> [kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be
> reached):
>
> [kit001.localdomain:08221]  kit001
>
> leaving ompi_mtl_psm2_add_procs nprocs 2
>
>
>
> I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
> and that works correctly on a single node.
>
> I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
> specific problem.
>
>
>
> 2016-04-19 12:25 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>:
>
> Hi Howard,
>
>
>
> Couple more questions to understand a little better the context:
>
> -          What type of job running?
>
> -          Is this also under srun?
>
>
>
> For PSM2 you may find more details in the programmer’s guide:
>
>
> http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf
>
>
>
> To disable shared memory:
>
> Section 2.7.1:
>
> PSM2_DEVICES="self,fi"
>
>
>
> Thanks,
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 11:04 AM
> *To:* Open MPI Developers List <de...@open-mpi.org>
> *Subject:* [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Folks,
>
>
>
> I'm making progress with issue #1559 (patches on the mail list didn't
> help),
>
> and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
>
> noticing something more troublesome.
>
>
>
> If I run on just one node, and I use more than one process, process zero
>
> consistently hangs in psm2_ep_connect.
>
>
>
> I've tried using the psm2 code on github - at sha e951cf31, but I still see
>
> the same behavior.
>
>
>
> The PSM2 related rpms installed on our system are:
>
>
>
> infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> hfi1-*psm*-0.7-221.ch6.x86_64
>
> hfi1-*psm*-devel-0.7-221.ch6.x86_64
>
> infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> should we get newer rpms installed?
>
>
>
> Is there a way to disable the AMSHM path?  I'm wondering if that
>
> would help since multi-node jobs seems to run fine.
>
>
>
> Thanks for any help,
>
>
>
> Howard
>
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18783.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18787.php
>

Reply via email to