[OMPI devel] Slurm support in master

2015-09-08 Thread Ralph Castain
Hi folks

I’ve poked around this evening and gotten the Slurm support in master to at 
least build, and for mpirun to now work correctly under a Slurm job allocation. 
This should all be committed as soon as auto-testing completes:

https://github.com/open-mpi/ompi/pull/877

Howard/Nathan: I believe I fixed mpirun for ALPS too - please check.

Direct launch under Slurm still segfaults, and I’m out of time chasing it down. 
Could someone please take a look? It seems to have something to do with the 
hash table support in the base, but I’m not sure of the problem.

Thanks
Ralph



Re: [OMPI devel] Cross-job disconnect is broken

2015-09-08 Thread Jeff Squyres (jsquyres)
On Sep 8, 2015, at 4:59 PM, George Bosilca  wrote:
> 
> Why would anyone use connect/accept (or join) between processes on the same 
> job? The only environment where such a functionality makes sense is where 
> disjoint applications (think computing part and the visualization part) are 
> able to connect together. There are application that use such a model, but I 
> bet they don't use OMPI.

FWIW, we have a few tests that use this functionality, IIRC.  I think it make 
it easy to test various code paths (i.e., we don't *actually* have to spawn - 
we could create intercommunicators and/or test the accept/connect and join code 
paths, etc.).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Cross-job disconnect is broken

2015-09-08 Thread Ralph Castain
It’s called comm_spawn, which involves the connect/accept code after launch :-)


> On Sep 8, 2015, at 1:59 PM, George Bosilca  wrote:
> 
> Why would anyone use connect/accept (or join) between processes on the same 
> job? The only environment where such a functionality makes sense is where 
> disjoint applications (think computing part and the visualization part) are 
> able to connect together. There are application that use such a model, but I 
> bet they don't use OMPI.
> 
>   George.
> 
> 
> On Tue, Sep 8, 2015 at 4:50 PM, Jeff Squyres (jsquyres)  > wrote:
> On Sep 7, 2015, at 5:07 PM, Ralph Castain  > wrote:
> >
> > * two jobs started by the same mpirun - supported today by ORTE
> >
> > * two jobs started by different mpiruns - we used to support, but is broken 
> > in grpcomm/barrier
> >
> > * two direct-launched jobs  - never supported
> >
> > * one direct-launched job and one started by mpirun  - never supported
> >
> > Given lack of use out there, I don’t see a reason to hold release of the 
> > 2.x series over this issue. Will keep you posted on progress towards a 
> > resolution
> 
> +1
> 
> That being said, I think it *would* be useful to be able to connect/accept 
> between "two jobs started by different mpiruns."  It's a bit of a 
> chicken-n-egg problem: no one does it because no one supports it, and vice 
> versa.
> 
> I agree it's low in the priority list, but I'd put the "two jobs started by 
> different mpiruns" in the "nice to have" category.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com 
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/ 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17983.php 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17984.php



Re: [OMPI devel] Cross-job disconnect is broken

2015-09-08 Thread George Bosilca
Why would anyone use connect/accept (or join) between processes on the same
job? The only environment where such a functionality makes sense is where
disjoint applications (think computing part and the visualization part) are
able to connect together. There are application that use such a model, but
I bet they don't use OMPI.

  George.


On Tue, Sep 8, 2015 at 4:50 PM, Jeff Squyres (jsquyres) 
wrote:

> On Sep 7, 2015, at 5:07 PM, Ralph Castain  wrote:
> >
> > * two jobs started by the same mpirun - supported today by ORTE
> >
> > * two jobs started by different mpiruns - we used to support, but is
> broken in grpcomm/barrier
> >
> > * two direct-launched jobs  - never supported
> >
> > * one direct-launched job and one started by mpirun  - never supported
> >
> > Given lack of use out there, I don’t see a reason to hold release of the
> 2.x series over this issue. Will keep you posted on progress towards a
> resolution
>
> +1
>
> That being said, I think it *would* be useful to be able to connect/accept
> between "two jobs started by different mpiruns."  It's a bit of a
> chicken-n-egg problem: no one does it because no one supports it, and vice
> versa.
>
> I agree it's low in the priority list, but I'd put the "two jobs started
> by different mpiruns" in the "nice to have" category.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17983.php


Re: [OMPI devel] Cross-job disconnect is broken

2015-09-08 Thread Jeff Squyres (jsquyres)
On Sep 7, 2015, at 5:07 PM, Ralph Castain  wrote:
> 
> * two jobs started by the same mpirun - supported today by ORTE
> 
> * two jobs started by different mpiruns - we used to support, but is broken 
> in grpcomm/barrier
> 
> * two direct-launched jobs  - never supported
> 
> * one direct-launched job and one started by mpirun  - never supported
> 
> Given lack of use out there, I don’t see a reason to hold release of the 2.x 
> series over this issue. Will keep you posted on progress towards a resolution

+1

That being said, I think it *would* be useful to be able to connect/accept 
between "two jobs started by different mpiruns."  It's a bit of a chicken-n-egg 
problem: no one does it because no one supports it, and vice versa.

I agree it's low in the priority list, but I'd put the "two jobs started by 
different mpiruns" in the "nice to have" category.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] psm mtl weirdness

2015-09-08 Thread Friedley, Andrew
Hi Howard,

Is this new behavior?

Do you see the error if you set PSM_DEVICES=shm,self ?  The PSM MTL should be 
setting this on its own, but maybe something changed.

Andrew

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, September 8, 2015 10:06 AM
To: Open MPI Developers List
Subject: [OMPI devel] psm mtl weirdness

Hi PSM folks,

I'm noticing some weirdness on master using the psm mtl.
If I run multi-node, I don't see a problem.  If I run using only a
single node, however, and use more than 1 rank, then I get
a timeout in psm_ep_connect.

On ompi-release I also observe this problem, but it seems
to be more sporadic.

I don't think this has anything to do with the pmix work.

I do not have access to a system using psm2, so can't
check to see if the problem also occurs there.

Thanks for any ideas on how to debug this.

Howard



Re: [OMPI devel] MTT failures since the last few days on ppc64

2015-09-08 Thread Jeff Squyres (jsquyres)
Thanks Adrian; I turned this into https://github.com/open-mpi/ompi/issues/874.

> On Sep 8, 2015, at 9:56 AM, Adrian Reber  wrote:
> 
> Since a few days the MTT runs on my ppc64 systems are failing with:
> 
> [bimini:11716] *** Process received signal ***
> [bimini:11716] Signal: Segmentation fault (11)
> [bimini:11716] Signal code: Address not mapped (1)
> [bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] [0x3fffa2bb0448]
> [bimini:11716] [ 1] /lib64/libc.so.6(+0xcb074)[0x3fffa27eb074] [bimini:11716] 
> [ 2]
> /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(opal_pmix_pmix1xx_pmix_value_xfer-0x68758)[0x3fffa2158a10]
>  [bimini:11716] [ 3]
> /home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(OPAL_PMIX_PMIX1XX_PMIx_Put-0x48338)[0x3fffa2179f70]
>  [bimini:11716] [ 4]
> /home/adrian/mtt-scratch/installs/GubX/install/lib/openmpi/mca_pmix_pmix1xx.so(pmix1_put-0x27efc)[0x3fffa21d858c]
> 
> I think I do not see these kind of errors on any of the other MTT setups
> so it might be ppc64 related. Just wanted to point it out.
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17979.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] psm mtl weirdness

2015-09-08 Thread Howard Pritchard
Hi PSM folks,

I'm noticing some weirdness on master using the psm mtl.
If I run multi-node, I don't see a problem.  If I run using only a
single node, however, and use more than 1 rank, then I get
a timeout in psm_ep_connect.

On ompi-release I also observe this problem, but it seems
to be more sporadic.

I don't think this has anything to do with the pmix work.

I do not have access to a system using psm2, so can't
check to see if the problem also occurs there.

Thanks for any ideas on how to debug this.

Howard


[OMPI devel] MTT failures since the last few days on ppc64

2015-09-08 Thread Adrian Reber
Since a few days the MTT runs on my ppc64 systems are failing with:

[bimini:11716] *** Process received signal ***
[bimini:11716] Signal: Segmentation fault (11)
[bimini:11716] Signal code: Address not mapped (1)
[bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] [0x3fffa2bb0448]
[bimini:11716] [ 1] /lib64/libc.so.6(+0xcb074)[0x3fffa27eb074] [bimini:11716] [ 
2]
/home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(opal_pmix_pmix1xx_pmix_value_xfer-0x68758)[0x3fffa2158a10]
 [bimini:11716] [ 3]
/home/adrian/mtt-scratch/installs/GubX/install/lib/libpmix.so.0(OPAL_PMIX_PMIX1XX_PMIx_Put-0x48338)[0x3fffa2179f70]
 [bimini:11716] [ 4]
/home/adrian/mtt-scratch/installs/GubX/install/lib/openmpi/mca_pmix_pmix1xx.so(pmix1_put-0x27efc)[0x3fffa21d858c]

I think I do not see these kind of errors on any of the other MTT setups
so it might be ppc64 related. Just wanted to point it out.

Adrian