Hi Matias,

I updated the issue 1559 with the info requested.
It might be simpler to just switch over to using the issue
for tracking this conversation?

I don't want to be posting big attachments emails on this
list.

Thanks,

Howard


2016-04-20 19:21 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>:

> Hi Howard,
>
>
>
> I’ve been playing with the same version of psm (
> hfi1-psm-0.7-221.ch6.x86_64) but cannot yet reproduce the issue.  Just in
> case, please share the version of the driver you have installed
> (hfi1-X.XX-XX.x86_64.rpm, modinfo hfi1).
>
>
>
> What I can tell so far, is that I still suspect this has  some relation to
> the job_id, that OMPI uses to generate the unique job key, that psm uses to
> generate the epid. By looking at the logfile.busted, I see some entries for
> ‘epid 10000’. This can only happen if psm2_ep_open() is called with a
> unique job key of 1 and having the PSM2 hfi device disabled (only shm
> communication expected). In your workaround (hfi enabled) the epid
> generation goes through a different path that includes the HFI LID which
> ends with different number.  HOWEVER, I hardcoded the above (to get epid
> 10000) case but I still see the hello_c running with stock OMPI 1.10.2.
>
>
> Would you please try forcing different jobid and share the results?
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Wednesday, April 20, 2016 8:49 AM
>
> *To:* Open MPI Developers <de...@open-mpi.org>
> *Subject:* Re: [OMPI devel] PSM2 Intel folks question
>
>
>
> HI Matias,
>
>
>
> Actually I found the problem.  I kept wondering why the OFI MTL works
> fine, but the
>
> PSM2 MTL doesn't.  When I cranked up the debugging level I noticed that
> for OFI MTL,
>
> it doesn't mess with the PSM2_DEVICES env variable.  So the PSM2 tries all
> three
>
> "devices" as part of initialization.  However, the PSM2 MTL sets the
> PSM2_DEVICES
>
> to not include hfi.  If I comment out those lines of code in the PSM2 MTL,
> my one-node
>
> problem vanishes.
>
>
>
> I suspect there's some setup code when "initializing" the hfi device that
> is actually
>
> required even when using the shm device for on-node messages.
>
>
>
> Is there by an chance some psm2 device driver parameter setting that might
>
> result in this behavior.
>
>
>
> Anyway, I set PSM2_TRACEMASK to 0xFFFF and got a bunch of output that
>
> might be helpful.  I attached the log files to issue 1559.
>
>
>
> For now, I will open a PR with fixes to get the PSM2 MTL working on our
>
> omnipath clusters.
>
>
>
> I don't think this problem has anything to do with SLURM except for the
> jobid
>
> manipulation to generate the unique key.
>
>
>
> Howard
>
>
>
>
>
> 2016-04-19 17:18 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>:
>
> Howard,
>
>
>
> PSM2_DEVICES, I went back to the roots and found that shm is the only
> device supporting communication between ranks in the same node. Therefore,
> the below error “Endpoint could not be reached” would be expected.
>
>
>
> Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have
> from github and have hello_c and ring_c running with 80 ranks on a local
> node using PSM2 mtl. I do not have any SLURM setup on my system.  I will
> proceed to setup SLURM to see if I can reproduce the issue with it. In the
> meantime please share any extra detail you find relevant.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 12:21 PM
> *To:* Open MPI Developers <de...@open-mpi.org>
> *Subject:* Re: [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Matias,
>
>
>
> My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
>
> If I disable the shared memory device using the PSM2_DEVICES option
>
> it looks like psm2 is unhappy:
>
>
>
>
>
> kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be
> reached):
>
> [kit001.localdomain:08222]  kit001
>
> [kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):
>
> [kit001.localdomain:08222]  kit001
>
>  psm2_ep_connect returned 41
>
> [kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):
>
> [kit001.localdomain:08221]  kit001
>
> [kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be
> reached):
>
> [kit001.localdomain:08221]  kit001
>
> leaving ompi_mtl_psm2_add_procs nprocs 2
>
>
>
> I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
> and that works correctly on a single node.
>
> I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
> specific problem.
>
>
>
> 2016-04-19 12:25 GMT-06:00 Cabral, Matias A <matias.a.cab...@intel.com>:
>
> Hi Howard,
>
>
>
> Couple more questions to understand a little better the context:
>
> -          What type of job running?
>
> -          Is this also under srun?
>
>
>
> For PSM2 you may find more details in the programmer’s guide:
>
>
> http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf
>
>
>
> To disable shared memory:
>
> Section 2.7.1:
>
> PSM2_DEVICES="self,fi"
>
>
>
> Thanks,
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 11:04 AM
> *To:* Open MPI Developers List <de...@open-mpi.org>
> *Subject:* [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Folks,
>
>
>
> I'm making progress with issue #1559 (patches on the mail list didn't
> help),
>
> and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
>
> noticing something more troublesome.
>
>
>
> If I run on just one node, and I use more than one process, process zero
>
> consistently hangs in psm2_ep_connect.
>
>
>
> I've tried using the psm2 code on github - at sha e951cf31, but I still see
>
> the same behavior.
>
>
>
> The PSM2 related rpms installed on our system are:
>
>
>
> infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> hfi1-*psm*-0.7-221.ch6.x86_64
>
> hfi1-*psm*-devel-0.7-221.ch6.x86_64
>
> infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> should we get newer rpms installed?
>
>
>
> Is there a way to disable the AMSHM path?  I'm wondering if that
>
> would help since multi-node jobs seems to run fine.
>
>
>
> Thanks for any help,
>
>
>
> Howard
>
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18783.php
>
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18787.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18800.php
>

Reply via email to