Hi Matias,

My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
If I disable the shared memory device using the PSM2_DEVICES option
it looks like psm2 is unhappy:


kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be
reached):

[kit001.localdomain:08222]  kit001

[kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):

[kit001.localdomain:08222]  kit001

 psm2_ep_connect returned 41

[kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):

[kit001.localdomain:08221]  kit001

[kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be
reached):

[kit001.localdomain:08221]  kit001

leaving ompi_mtl_psm2_add_procs nprocs 2


I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
and that works correctly on a single node.

I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
specific problem.

2016-04-19 12:25 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>:

> Hi Howard,
>
>
>
> Couple more questions to understand a little better the context:
>
> -          What type of job running?
>
> -          Is this also under srun?
>
>
>
> For PSM2 you may find more details in the programmer’s guide:
>
>
> http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf
>
>
>
> To disable shared memory:
>
> Section 2.7.1:
>
> PSM2_DEVICES="self,fi"
>
>
>
> Thanks,
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 11:04 AM
> *To:* Open MPI Developers List <de...@open-mpi.org>
> *Subject:* [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Folks,
>
>
>
> I'm making progress with issue #1559 (patches on the mail list didn't
> help),
>
> and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
>
> noticing something more troublesome.
>
>
>
> If I run on just one node, and I use more than one process, process zero
>
> consistently hangs in psm2_ep_connect.
>
>
>
> I've tried using the psm2 code on github - at sha e951cf31, but I still see
>
> the same behavior.
>
>
>
> The PSM2 related rpms installed on our system are:
>
>
>
> infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> hfi1-*psm*-0.7-221.ch6.x86_64
>
> hfi1-*psm*-devel-0.7-221.ch6.x86_64
>
> infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> should we get newer rpms installed?
>
>
>
> Is there a way to disable the AMSHM path?  I'm wondering if that
>
> would help since multi-node jobs seems to run fine.
>
>
>
> Thanks for any help,
>
>
>
> Howard
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18783.php
>

Reply via email to