Re: [OMPI devel] fixing a bug in 1.8 that's not in master

2014-10-27 Thread Ralph Castain
Just create a topic branch from v1.8 in a local clone of ompi-release, make the 
change there, and then file a PR on the ompi-release repo

Obviously, if it is a bug solely confined to v1.8, you can’t put it in master 
first :-)


> On Oct 27, 2014, at 3:22 PM, Howard Pritchard  wrote:
> 
> Hi Folks,
> 
> A cut and past error seems to have happened with
> plm_alps_modules.c in 1.8 which causes a compile failure
> when building for cray.  So right now, there's no building
> ompi 1.8 for crays. 
> 
> The problem is not present in master.
> 
> For these kinds of problems, are we suppose to bypass
> all the "has to be in master, need commit, etc." stuff described in 
> 
> https://github.com/open-mpi/ompi/wiki/SubmittingPullRequests 
> 
> 
> and just go straight to pushing to a fork of ompi-release, etc.
> as per the rest of the instructions on submitting pull requests?
> 
> Just want to make sure I'm doing the right thing here.
> 
> Howard
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/10/16104.php



[OMPI devel] fixing a bug in 1.8 that's not in master

2014-10-27 Thread Howard Pritchard
Hi Folks,

A cut and past error seems to have happened with
plm_alps_modules.c in 1.8 which causes a compile failure
when building for cray.  So right now, there's no building
ompi 1.8 for crays.

The problem is not present in master.

For these kinds of problems, are we suppose to bypass
all the "has to be in master, need commit, etc." stuff described in

https://github.com/open-mpi/ompi/wiki/SubmittingPullRequests

and just go straight to pushing to a fork of ompi-release, etc.
as per the rest of the instructions on submitting pull requests?

Just want to make sure I'm doing the right thing here.

Howard


Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Friedley, Andrew
Hi Adrian,

I'm unable to reproduce here with OMPI v1.8.3 (I assume you're doing this with 
one 8-core node):

$ mpirun -np 32 -mca pml cm -mca mtl psm ./mpi_test_suite -t "environment"
(Rank:0) tst_test_array[0]:Status
(Rank:0) tst_test_array[1]:Request_Null
(Rank:0) tst_test_array[2]:Type_dup
(Rank:0) tst_test_array[3]:Get_version
Number of failed tests:0

Works with various np from 8 to 32.  Your original case:

$ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"

Runs for a while and eventually hits send cancellation errors.

Any chance you could try updating your infinipath libraries?

Andrew

> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> Reber
> Sent: Monday, October 27, 2014 9:11 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> 
> This is a simpler test setup:
> 
> On 8 core machines this works:
> 
> $ mpirun  -np 8  mpi_test_suite -t "environment"
> [...]
> Number of failed tests:0
> 
> Using 9 or more cores it fails:
> 
> $ mpirun  -np 9  mpi_test_suite -t "environment"
> 
> mpi_test_suite:20293 terminated with signal 11 at PC=2b6d107fa9a4
> SP=7fff06431a70.  Backtrace:
> /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b6d107fa9a
> 4]
> /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b6d107eb1
> 72]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> 4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b6d0fa6e384]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> 4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b6d0f93376a]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> 4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b6d0f963d42]
> mpi_test_suite[0x46cd00]
> mpi_test_suite[0x44434c]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b6d10047d5d]
> mpi_test_suite[0x4058e9]
> ---
> Primary job  terminated normally, but 1 process returned a non-zero exit
> code.. Per user-direction, the job has been aborted.
> ---
> 
> mpi_test_suite:11212 terminated with signal 11 at PC=2b2c27d0d9a4
> SP=75020430.  Backtrace:
> /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b2c27d0d9a
> 4]
> /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b2c27cfe17
> 2]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> 4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b2c26f81384]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> 4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b2c26e4676a]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> 4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b2c26e76d42]
> mpi_test_suite[0x46cd00]
> mpi_test_suite[0x44434c]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b2c2755ad5d]
> mpi_test_suite[0x4058e9]
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing the job to be terminated. The first process to do so was:
> 
>   Process name: [[47415,1],0]
>   Exit code:1
> --
> 
> 
> 
> On Mon, Oct 27, 2014 at 08:27:17AM -0700, Ralph Castain wrote:
> > I’m afraid I can’t quite decipher from all this what actually fails. Of 
> > course,
> PSM doesn’t support dynamic operations like comm_spawn or
> connect_accept, so if you are running those tests that just won’t work. Is
> that the heart of the problem here?
> >
> >
> > > On Oct 27, 2014, at 1:40 AM, Adrian Reber  wrote:
> > >
> > > Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
> > > I am getting the same errors also on trunk from my newly set up MTT.
> > > Before trying to debug this I just wanted to make sure this is not a
> > > configuration error. I have following PSM packages installed:
> > >
> > > infinipath-devel-3.1.1-363.1140_rhel6_qlc.noarch
> > > infinipath-libs-3.1.1-363.1140_rhel6_qlc.x86_64
> > > infinipath-3.1.1-363.1140_rhel6_qlc.x86_64
> > >
> > > with 1.6.5 I do not see PSM errors and the test suite fails much later:
> > >
> > > P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48),
> comm
> > > Intracomm merged of the Halved Intercomm (13/13), type
> > > MPI_TYPE_MIX_ARRAY (28/29) P2P tests Many-to-one with MPI_Iprobe
> > > (MPI_ANY_SOURCE) (21/48), comm Intracomm merged of the Halved
> > > Intercomm (13/13), type MPI_TYPE_MIX_LB_UB (29/29)
> > > n050304:5.0.Cannot cancel send requests (req=0x2ad8ba881f80) P2P
> > > tests Many-to-one with Isend and Cancellation (22/48), comm
> > > MPI_COMM_WORLD (1/13), type MPI_CHAR (1/29) n050304:2.0.Cannot
> > > cancel send requests (req=0x2b25143fbd88) n050302:7.0.Cannot cancel
> > > send requests (req=0x2b4d95eb0f80) n050301:4.0.Cannot cancel send
> > > requests (req=0x2adf03e14f80) n050304:4.0.Cannot cancel send
> > > requests (req=0x2ad877257ed8) n050301:6.0.Cannot cancel send
> > > requests (req=0x2ba47634af80) n050304:8.0.Cannot 

[OMPI devel] Interesting warning in openib...

2014-10-27 Thread Ralph Castain
Saw this in CentOS7 using gcc 4.8.2:

btl_openib_component.c: In function 'init_one_device':
btl_openib_component.c:2019:54: warning: comparison between 'enum ' 
and 'mca_base_var_source_t' [-Wenum-compare]
 else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==

Ralph



Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Ralph Castain
Andrew@Intel is looking into it - he has some PSM patches coming that may 
resolve this already.


> On Oct 27, 2014, at 9:10 AM, Adrian Reber  wrote:
> 
> This is a simpler test setup:
> 
> On 8 core machines this works:
> 
> $ mpirun  -np 8  mpi_test_suite -t "environment"
> [...]
> Number of failed tests:0
> 
> Using 9 or more cores it fails:
> 
> $ mpirun  -np 9  mpi_test_suite -t "environment"
> 
> mpi_test_suite:20293 terminated with signal 11 at PC=2b6d107fa9a4 
> SP=7fff06431a70.  Backtrace:
> /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b6d107fa9a4]
> /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b6d107eb172]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b6d0fa6e384]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b6d0f93376a]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b6d0f963d42]
> mpi_test_suite[0x46cd00]
> mpi_test_suite[0x44434c]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b6d10047d5d]
> mpi_test_suite[0x4058e9]
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> 
> mpi_test_suite:11212 terminated with signal 11 at PC=2b2c27d0d9a4 
> SP=75020430.  Backtrace:
> /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b2c27d0d9a4]
> /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b2c27cfe172]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b2c26f81384]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b2c26e4676a]
> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b2c26e76d42]
> mpi_test_suite[0x46cd00]
> mpi_test_suite[0x44434c]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b2c2755ad5d]
> mpi_test_suite[0x4058e9]
> --
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>  Process name: [[47415,1],0]
>  Exit code:1
> --
> 
> 
> 
> On Mon, Oct 27, 2014 at 08:27:17AM -0700, Ralph Castain wrote:
>> I’m afraid I can’t quite decipher from all this what actually fails. Of 
>> course, PSM doesn’t support dynamic operations like comm_spawn or 
>> connect_accept, so if you are running those tests that just won’t work. Is 
>> that the heart of the problem here?
>> 
>> 
>>> On Oct 27, 2014, at 1:40 AM, Adrian Reber  wrote:
>>> 
>>> Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
>>> I am getting the same errors also on trunk from my newly set up MTT.
>>> Before trying to debug this I just wanted to make sure this is not a
>>> configuration error. I have following PSM packages installed:
>>> 
>>> infinipath-devel-3.1.1-363.1140_rhel6_qlc.noarch
>>> infinipath-libs-3.1.1-363.1140_rhel6_qlc.x86_64
>>> infinipath-3.1.1-363.1140_rhel6_qlc.x86_64
>>> 
>>> with 1.6.5 I do not see PSM errors and the test suite fails much later:
>>> 
>>> P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm 
>>> Intracomm merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_ARRAY 
>>> (28/29)
>>> P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm 
>>> Intracomm merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_LB_UB 
>>> (29/29)
>>> n050304:5.0.Cannot cancel send requests (req=0x2ad8ba881f80)
>>> P2P tests Many-to-one with Isend and Cancellation (22/48), comm 
>>> MPI_COMM_WORLD (1/13), type MPI_CHAR (1/29)
>>> n050304:2.0.Cannot cancel send requests (req=0x2b25143fbd88)
>>> n050302:7.0.Cannot cancel send requests (req=0x2b4d95eb0f80)
>>> n050301:4.0.Cannot cancel send requests (req=0x2adf03e14f80)
>>> n050304:4.0.Cannot cancel send requests (req=0x2ad877257ed8)
>>> n050301:6.0.Cannot cancel send requests (req=0x2ba47634af80)
>>> n050304:8.0.Cannot cancel send requests (req=0x2ae8ac16cf80)
>>> n050302:3.0.Cannot cancel send requests (req=0x2ab81dcb4d88)
>>> n050303:4.0.Cannot cancel send requests (req=0x2b9ef4ef8f80)
>>> n050303:2.0.Cannot cancel send requests (req=0x2ab0f03f9f80)
>>> n050302:9.0.Cannot cancel send requests (req=0x2b214f9ebed8)
>>> n050301:2.0.Cannot cancel send requests (req=0x2b31302d4f80)
>>> n050302:4.0.Cannot cancel send requests (req=0x2b0581bd3f80)
>>> n050301:8.0.Cannot cancel send requests (req=0x2ae53776bf80)
>>> n050303:6.0.Cannot cancel send requests (req=0x2b13eeb78f80)
>>> n050304:7.0.Cannot cancel send requests (req=0x2b4e99715f80)
>>> n050304:9.0.Cannot cancel send requests (req=0x2b10429c2f80)
>>> n050304:3.0.Cannot cancel send 

Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Adrian Reber
This is a simpler test setup:

On 8 core machines this works:

$ mpirun  -np 8  mpi_test_suite -t "environment"
[...]
Number of failed tests:0

Using 9 or more cores it fails:

$ mpirun  -np 9  mpi_test_suite -t "environment"

mpi_test_suite:20293 terminated with signal 11 at PC=2b6d107fa9a4 
SP=7fff06431a70.  Backtrace:
/usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b6d107fa9a4]
/usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b6d107eb172]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b6d0fa6e384]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b6d0f93376a]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b6d0f963d42]
mpi_test_suite[0x46cd00]
mpi_test_suite[0x44434c]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b6d10047d5d]
mpi_test_suite[0x4058e9]
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---

mpi_test_suite:11212 terminated with signal 11 at PC=2b2c27d0d9a4 
SP=75020430.  Backtrace:
/usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b2c27d0d9a4]
/usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b2c27cfe172]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b2c26f81384]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(ompi_comm_get_rprocs+0x2fa)[0x2b2c26e4676a]
/opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-4.9/lib/libmpi.so.1(MPI_Intercomm_create+0x332)[0x2b2c26e76d42]
mpi_test_suite[0x46cd00]
mpi_test_suite[0x44434c]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b2c2755ad5d]
mpi_test_suite[0x4058e9]
--
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[47415,1],0]
  Exit code:1
--



On Mon, Oct 27, 2014 at 08:27:17AM -0700, Ralph Castain wrote:
> I’m afraid I can’t quite decipher from all this what actually fails. Of 
> course, PSM doesn’t support dynamic operations like comm_spawn or 
> connect_accept, so if you are running those tests that just won’t work. Is 
> that the heart of the problem here?
> 
> 
> > On Oct 27, 2014, at 1:40 AM, Adrian Reber  wrote:
> > 
> > Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
> > I am getting the same errors also on trunk from my newly set up MTT.
> > Before trying to debug this I just wanted to make sure this is not a
> > configuration error. I have following PSM packages installed:
> > 
> > infinipath-devel-3.1.1-363.1140_rhel6_qlc.noarch
> > infinipath-libs-3.1.1-363.1140_rhel6_qlc.x86_64
> > infinipath-3.1.1-363.1140_rhel6_qlc.x86_64
> > 
> > with 1.6.5 I do not see PSM errors and the test suite fails much later:
> > 
> > P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm 
> > Intracomm merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_ARRAY 
> > (28/29)
> > P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm 
> > Intracomm merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_LB_UB 
> > (29/29)
> > n050304:5.0.Cannot cancel send requests (req=0x2ad8ba881f80)
> > P2P tests Many-to-one with Isend and Cancellation (22/48), comm 
> > MPI_COMM_WORLD (1/13), type MPI_CHAR (1/29)
> > n050304:2.0.Cannot cancel send requests (req=0x2b25143fbd88)
> > n050302:7.0.Cannot cancel send requests (req=0x2b4d95eb0f80)
> > n050301:4.0.Cannot cancel send requests (req=0x2adf03e14f80)
> > n050304:4.0.Cannot cancel send requests (req=0x2ad877257ed8)
> > n050301:6.0.Cannot cancel send requests (req=0x2ba47634af80)
> > n050304:8.0.Cannot cancel send requests (req=0x2ae8ac16cf80)
> > n050302:3.0.Cannot cancel send requests (req=0x2ab81dcb4d88)
> > n050303:4.0.Cannot cancel send requests (req=0x2b9ef4ef8f80)
> > n050303:2.0.Cannot cancel send requests (req=0x2ab0f03f9f80)
> > n050302:9.0.Cannot cancel send requests (req=0x2b214f9ebed8)
> > n050301:2.0.Cannot cancel send requests (req=0x2b31302d4f80)
> > n050302:4.0.Cannot cancel send requests (req=0x2b0581bd3f80)
> > n050301:8.0.Cannot cancel send requests (req=0x2ae53776bf80)
> > n050303:6.0.Cannot cancel send requests (req=0x2b13eeb78f80)
> > n050304:7.0.Cannot cancel send requests (req=0x2b4e99715f80)
> > n050304:9.0.Cannot cancel send requests (req=0x2b10429c2f80)
> > n050304:3.0.Cannot cancel send requests (req=0x2b9196f5fe30)
> > n050304:6.0.Cannot cancel send requests (req=0x2b30d6c69ed8)
> > n050301:9.0.Cannot cancel send requests (req=0x2b93c9e04f80)
> > n050303:9.0.Cannot cancel send requests (req=0x2ab4d6ce0f80)
> > n050301:5.0.Cannot cancel send requests 

Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Ralph Castain
I’m afraid I can’t quite decipher from all this what actually fails. Of course, 
PSM doesn’t support dynamic operations like comm_spawn or connect_accept, so if 
you are running those tests that just won’t work. Is that the heart of the 
problem here?


> On Oct 27, 2014, at 1:40 AM, Adrian Reber  wrote:
> 
> Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
> I am getting the same errors also on trunk from my newly set up MTT.
> Before trying to debug this I just wanted to make sure this is not a
> configuration error. I have following PSM packages installed:
> 
> infinipath-devel-3.1.1-363.1140_rhel6_qlc.noarch
> infinipath-libs-3.1.1-363.1140_rhel6_qlc.x86_64
> infinipath-3.1.1-363.1140_rhel6_qlc.x86_64
> 
> with 1.6.5 I do not see PSM errors and the test suite fails much later:
> 
> P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm 
> Intracomm merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_ARRAY 
> (28/29)
> P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm 
> Intracomm merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_LB_UB 
> (29/29)
> n050304:5.0.Cannot cancel send requests (req=0x2ad8ba881f80)
> P2P tests Many-to-one with Isend and Cancellation (22/48), comm 
> MPI_COMM_WORLD (1/13), type MPI_CHAR (1/29)
> n050304:2.0.Cannot cancel send requests (req=0x2b25143fbd88)
> n050302:7.0.Cannot cancel send requests (req=0x2b4d95eb0f80)
> n050301:4.0.Cannot cancel send requests (req=0x2adf03e14f80)
> n050304:4.0.Cannot cancel send requests (req=0x2ad877257ed8)
> n050301:6.0.Cannot cancel send requests (req=0x2ba47634af80)
> n050304:8.0.Cannot cancel send requests (req=0x2ae8ac16cf80)
> n050302:3.0.Cannot cancel send requests (req=0x2ab81dcb4d88)
> n050303:4.0.Cannot cancel send requests (req=0x2b9ef4ef8f80)
> n050303:2.0.Cannot cancel send requests (req=0x2ab0f03f9f80)
> n050302:9.0.Cannot cancel send requests (req=0x2b214f9ebed8)
> n050301:2.0.Cannot cancel send requests (req=0x2b31302d4f80)
> n050302:4.0.Cannot cancel send requests (req=0x2b0581bd3f80)
> n050301:8.0.Cannot cancel send requests (req=0x2ae53776bf80)
> n050303:6.0.Cannot cancel send requests (req=0x2b13eeb78f80)
> n050304:7.0.Cannot cancel send requests (req=0x2b4e99715f80)
> n050304:9.0.Cannot cancel send requests (req=0x2b10429c2f80)
> n050304:3.0.Cannot cancel send requests (req=0x2b9196f5fe30)
> n050304:6.0.Cannot cancel send requests (req=0x2b30d6c69ed8)
> n050301:9.0.Cannot cancel send requests (req=0x2b93c9e04f80)
> n050303:9.0.Cannot cancel send requests (req=0x2ab4d6ce0f80)
> n050301:5.0.Cannot cancel send requests (req=0x2b6ad851ef80)
> n050303:3.0.Cannot cancel send requests (req=0x2b8ef52a0f80)
> n050301:3.0.Cannot cancel send requests (req=0x2b277a4aff80)
> n050303:7.0.Cannot cancel send requests (req=0x2ba570fa9f80)
> n050301:7.0.Cannot cancel send requests (req=0x2ba707dfbf80)
> n050302:2.0.Cannot cancel send requests (req=0x2b90f2e51e30)
> n050303:5.0.Cannot cancel send requests (req=0x2b1250ba8f80)
> n050302:8.0.Cannot cancel send requests (req=0x2b22e0129ed8)
> n050303:8.0.Cannot cancel send requests (req=0x2b6609792f80)
> n050302:6.0.Cannot cancel send requests (req=0x2b2b6081af80)
> n050302:5.0.Cannot cancel send requests (req=0x2ab24f6f1f80)
> --
> mpirun has exited due to process rank 14 with PID 4496 on
> node n050303 exiting improperly. There are two reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --
> [adrian@n050304 mpi_test_suite]$
> 
> and this are my PSM errors with 1.8.3:
> 
> [adrian@n050304 mpi_test_suite]$ mpirun  -np 32  mpi_test_suite -t 
> "All,^io,^one-sided"
> 
> mpi_test_suite:8904 terminated with signal 11 at PC=2b08466239a4 
> SP=703c6e30.  Backtrace:
> 
> mpi_test_suite:16905 terminated with signal 11 at PC=2ae4cad209a4 
> SP=7fffceefa730.  Backtrace:
> 
> mpi_test_suite:3171 terminated with signal 11 at PC=2b57daafe9a4 
> SP=7fff5c4b3af0.  Backtrace:
> 
> mpi_test_suite:16906 terminated with signal 11 at PC=2b4c9fa019a4 
> SP=7fffe916c330.  Backtrace:
> 
> mpi_test_suite:3172 terminated with signal 11 at PC=2b6dde92e9a4 
> SP=7fff04cf1730.  Backtrace:
> 
> mpi_test_suite:16907 terminated with signal 11 at PC=2ad6eb8589a4 
> SP=7fffc30d02f0.  Backtrace:
> 
> mpi_test_suite:3173 

Re: [OMPI devel] errno and reentrance

2014-10-27 Thread Gilles Gouaillardet
Thanks Paul,

so the simplest way is to force -D_REENTRANT on Solaris, i will do that !

Cheers,

Gilles

On 2014/10/27 19:36, Paul Hargrove wrote:
> Gilles,
>
> I responded too quickly, not thinking that this test is pretty quick and
> doesn't require that I try sparc, ppc, ia64, etc.
> So my results:
>
> Solaris-{10,11}:
>   With "cc" I agree with your findings (need -D_REENTRANT for correct
> behavior).
>   With gcc either "-pthread" or "-D_REENTRANT" gave correct behavior
>
> NetBSD-5:
>   Got "KO: error 4 (0)" no matter what I tried
>
> Linux,  FreeBSD-{9,10}, NetBSD-6, OpenBSD-5:
>   Using "-pthread" or "-lpthread" was necessary to link, and sufficient for
> correct results.
>
> MacOSX-10.{5,6,7,8}:
>   No compiler options were required for 'cc' (which has been gcc, llvm-gcc
> and clang through those OS revs)
>
> Though I have access, I did not try compute nodes on BG/Q or Cray X{E,K,C}.
> Let me know if any of those are of significant concern.
>
> I no longer have AIX or IRIX access.
>
> -Paul
>
>
> On Mon, Oct 27, 2014 at 2:48 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>>  Thanks Paul !
>>
>> Gilles
>>
>> On 2014/10/27 18:47, Paul Hargrove wrote:
>>
>> On Mon, Oct 27, 2014 at 2:42 AM, Gilles Gouaillardet 
>>  wrote:
>> [...]
>>
>>
>>  Paul, since you have access to many platforms, could you please run this
>> test with and without -D_REENTRANT / -D_THREAD_SAFE
>> and tell me where the program produces incorrect behaviour (output is
>> KO...) without the flag ?
>>
>> Thanks in advance,
>>
>> Gilles
>>
>>
>>  Gilles,
>>
>> I have a lot of things due between now and the SC14 conference.
>> I've added this test to my to-do list, but cannot be sure of how soon I'll
>> be able to get results back to you.
>>
>> Feel free to remind me off-list,
>> -Paul
>>
>>
>>
>>
>>
>>
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/10/16095.php
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/10/16096.php
>>
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/10/16097.php



Re: [OMPI devel] errno and reentrance

2014-10-27 Thread Paul Hargrove
Gilles,

I responded too quickly, not thinking that this test is pretty quick and
doesn't require that I try sparc, ppc, ia64, etc.
So my results:

Solaris-{10,11}:
  With "cc" I agree with your findings (need -D_REENTRANT for correct
behavior).
  With gcc either "-pthread" or "-D_REENTRANT" gave correct behavior

NetBSD-5:
  Got "KO: error 4 (0)" no matter what I tried

Linux,  FreeBSD-{9,10}, NetBSD-6, OpenBSD-5:
  Using "-pthread" or "-lpthread" was necessary to link, and sufficient for
correct results.

MacOSX-10.{5,6,7,8}:
  No compiler options were required for 'cc' (which has been gcc, llvm-gcc
and clang through those OS revs)

Though I have access, I did not try compute nodes on BG/Q or Cray X{E,K,C}.
Let me know if any of those are of significant concern.

I no longer have AIX or IRIX access.

-Paul


On Mon, Oct 27, 2014 at 2:48 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Thanks Paul !
>
> Gilles
>
> On 2014/10/27 18:47, Paul Hargrove wrote:
>
> On Mon, Oct 27, 2014 at 2:42 AM, Gilles Gouaillardet 
>  wrote:
> [...]
>
>
>  Paul, since you have access to many platforms, could you please run this
> test with and without -D_REENTRANT / -D_THREAD_SAFE
> and tell me where the program produces incorrect behaviour (output is
> KO...) without the flag ?
>
> Thanks in advance,
>
> Gilles
>
>
>  Gilles,
>
> I have a lot of things due between now and the SC14 conference.
> I've added this test to my to-do list, but cannot be sure of how soon I'll
> be able to get results back to you.
>
> Feel free to remind me off-list,
> -Paul
>
>
>
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/10/16095.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/10/16096.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] errno and reentrance

2014-10-27 Thread Gilles Gouaillardet
Thanks Paul !

Gilles

On 2014/10/27 18:47, Paul Hargrove wrote:
> On Mon, Oct 27, 2014 at 2:42 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
> [...]
>
>> Paul, since you have access to many platforms, could you please run this
>> test with and without -D_REENTRANT / -D_THREAD_SAFE
>> and tell me where the program produces incorrect behaviour (output is
>> KO...) without the flag ?
>>
>> Thanks in advance,
>>
>> Gilles
>>
> Gilles,
>
> I have a lot of things due between now and the SC14 conference.
> I've added this test to my to-do list, but cannot be sure of how soon I'll
> be able to get results back to you.
>
> Feel free to remind me off-list,
> -Paul
>
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/10/16095.php



Re: [OMPI devel] errno and reentrance

2014-10-27 Thread Paul Hargrove
On Mon, Oct 27, 2014 at 2:42 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:
[...]

> Paul, since you have access to many platforms, could you please run this
> test with and without -D_REENTRANT / -D_THREAD_SAFE
> and tell me where the program produces incorrect behaviour (output is
> KO...) without the flag ?
>
> Thanks in advance,
>
> Gilles
>

Gilles,

I have a lot of things due between now and the SC14 conference.
I've added this test to my to-do list, but cannot be sure of how soon I'll
be able to get results back to you.

Feel free to remind me off-list,
-Paul



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] errno and reentrance

2014-10-27 Thread Gilles Gouaillardet
Folks,

While investigating an issue started at
http://www.open-mpi.org/community/lists/users/2014/10/25562.php
i found that it is mandatory to compile with -D_REENTRANT on Solaris (10
and 11)
(otherwise errno is not per thread specific, and the pmix thread
silently misinterpret EAGAIN or EWOULDBLOCK and that
leads to random behaviour, that generally terminates the application)

This is a bug / unexpected side effect introduced by me in commit
b1c4daa9567c7647318b9b673698c2251264f22e

on a RedHat 6 like server, this is not necessary.

on aix and/or freebsd, it might be necessary to compile with
-D_THREAD_SAFE in order to get a correct behaviour.

i wrote the simple attached program in order to check the correct
behavior with/without -D_REENTRANT or -D_THREAD_SAFE.

one option is to add automatically test this in
config/opal_config_pthreads.m4,
an other option is to hardcode this for the required OS.

Paul, since you have access to many platforms, could you please run this
test with and without -D_REENTRANT / -D_THREAD_SAFE
and tell me where the program produces incorrect behaviour (output is
KO...) without the flag ?

Thanks in advance,

Gilles
#include 
#include 
#include 
#include 

static void * fn (void * arg) {
if (errno == 1) {
return (void *)-1;
}
read(0, NULL, 0);
if (errno != 0) {
return (void *)-2;
}
errno = 2;
return NULL;
}

int main (int argc, char *argv[]) {
pthread_t t;
void *s = NULL;
errno = 1;
if (pthread_create(, NULL, fn, NULL) < 0) {
perror ("pthread_create ");
return 1;
}
if (pthread_join(t, ) < 0) {
perror ("pthread_join ");
return 2;
}
if (NULL != s) {
fprintf(stderr, "KO: error 3 (%ld)\n", (long)s);
return 3;
} else if (2 == errno) {
fprintf(stderr, "KO: error 4 (%ld)\n", (long)s);
return 4;
} else {
fprintf(stderr, "OK\n");
return 0;
}
}


[OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Adrian Reber
Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
I am getting the same errors also on trunk from my newly set up MTT.
Before trying to debug this I just wanted to make sure this is not a
configuration error. I have following PSM packages installed:

infinipath-devel-3.1.1-363.1140_rhel6_qlc.noarch
infinipath-libs-3.1.1-363.1140_rhel6_qlc.x86_64
infinipath-3.1.1-363.1140_rhel6_qlc.x86_64

with 1.6.5 I do not see PSM errors and the test suite fails much later:

P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm Intracomm 
merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_ARRAY (28/29)
P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm Intracomm 
merged of the Halved Intercomm (13/13), type MPI_TYPE_MIX_LB_UB (29/29)
n050304:5.0.Cannot cancel send requests (req=0x2ad8ba881f80)
P2P tests Many-to-one with Isend and Cancellation (22/48), comm MPI_COMM_WORLD 
(1/13), type MPI_CHAR (1/29)
n050304:2.0.Cannot cancel send requests (req=0x2b25143fbd88)
n050302:7.0.Cannot cancel send requests (req=0x2b4d95eb0f80)
n050301:4.0.Cannot cancel send requests (req=0x2adf03e14f80)
n050304:4.0.Cannot cancel send requests (req=0x2ad877257ed8)
n050301:6.0.Cannot cancel send requests (req=0x2ba47634af80)
n050304:8.0.Cannot cancel send requests (req=0x2ae8ac16cf80)
n050302:3.0.Cannot cancel send requests (req=0x2ab81dcb4d88)
n050303:4.0.Cannot cancel send requests (req=0x2b9ef4ef8f80)
n050303:2.0.Cannot cancel send requests (req=0x2ab0f03f9f80)
n050302:9.0.Cannot cancel send requests (req=0x2b214f9ebed8)
n050301:2.0.Cannot cancel send requests (req=0x2b31302d4f80)
n050302:4.0.Cannot cancel send requests (req=0x2b0581bd3f80)
n050301:8.0.Cannot cancel send requests (req=0x2ae53776bf80)
n050303:6.0.Cannot cancel send requests (req=0x2b13eeb78f80)
n050304:7.0.Cannot cancel send requests (req=0x2b4e99715f80)
n050304:9.0.Cannot cancel send requests (req=0x2b10429c2f80)
n050304:3.0.Cannot cancel send requests (req=0x2b9196f5fe30)
n050304:6.0.Cannot cancel send requests (req=0x2b30d6c69ed8)
n050301:9.0.Cannot cancel send requests (req=0x2b93c9e04f80)
n050303:9.0.Cannot cancel send requests (req=0x2ab4d6ce0f80)
n050301:5.0.Cannot cancel send requests (req=0x2b6ad851ef80)
n050303:3.0.Cannot cancel send requests (req=0x2b8ef52a0f80)
n050301:3.0.Cannot cancel send requests (req=0x2b277a4aff80)
n050303:7.0.Cannot cancel send requests (req=0x2ba570fa9f80)
n050301:7.0.Cannot cancel send requests (req=0x2ba707dfbf80)
n050302:2.0.Cannot cancel send requests (req=0x2b90f2e51e30)
n050303:5.0.Cannot cancel send requests (req=0x2b1250ba8f80)
n050302:8.0.Cannot cancel send requests (req=0x2b22e0129ed8)
n050303:8.0.Cannot cancel send requests (req=0x2b6609792f80)
n050302:6.0.Cannot cancel send requests (req=0x2b2b6081af80)
n050302:5.0.Cannot cancel send requests (req=0x2ab24f6f1f80)
--
mpirun has exited due to process rank 14 with PID 4496 on
node n050303 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[adrian@n050304 mpi_test_suite]$

and this are my PSM errors with 1.8.3:

[adrian@n050304 mpi_test_suite]$ mpirun  -np 32  mpi_test_suite -t 
"All,^io,^one-sided"

mpi_test_suite:8904 terminated with signal 11 at PC=2b08466239a4 
SP=703c6e30.  Backtrace:

mpi_test_suite:16905 terminated with signal 11 at PC=2ae4cad209a4 
SP=7fffceefa730.  Backtrace:

mpi_test_suite:3171 terminated with signal 11 at PC=2b57daafe9a4 
SP=7fff5c4b3af0.  Backtrace:

mpi_test_suite:16906 terminated with signal 11 at PC=2b4c9fa019a4 
SP=7fffe916c330.  Backtrace:

mpi_test_suite:3172 terminated with signal 11 at PC=2b6dde92e9a4 
SP=7fff04cf1730.  Backtrace:

mpi_test_suite:16907 terminated with signal 11 at PC=2ad6eb8589a4 
SP=7fffc30d02f0.  Backtrace:

mpi_test_suite:3173 terminated with signal 11 at PC=2b2e4aec89a4 
SP=7fffa054e230.  Backtrace:

mpi_test_suite:16908 terminated with signal 11 at PC=2b4e6e5589a4 
SP=7fff68c7a1f0.  Backtrace:

mpi_test_suite:3174 terminated with signal 11 at PC=2b7049b279a4 
SP=7fff99a49f70.  Backtrace:

mpi_test_suite:16909 terminated with signal 11 at PC=2b252219d9a4 
SP=7fff72a0c6b0.  Backtrace:

mpi_test_suite:3175 terminated with signal 11 at PC=2ac8d5caf9a4 
SP=7fff6d7a63f0.  Backtrace:

mpi_test_suite:16910 terminated with signal 11 at