[OMPI devel] Bcol/mcol violations

2014-02-06 Thread Ralph Castain
As many of you will have noticed, I have been struggling most of the evening 
with breakage on the trunk. This was initiated by adding .ompi_ignore to the 
coll/ml component, but the root cause of the problem is a blatant disregard for 
OMPI design rules in the bcol framework. Component-level headers from the 
coll/ml area have been included in multiple places throughout the bcol 
framework, making it impossible to separate these two areas.

Unfortunately, this problem has now been propagated to the 1.7 branch. As 
release manager, I'm afraid that places me in a difficult position, and I'm 
going to have to insist that this either is fixed immediately (i.e., in next 24 
hours), or I have to rescind/delete that area from the 1.7 branch and release 
an immediate 1.7.5 (with attendant apologies to the community for the screwup). 
We will then proceed with our intended plan, minus the bcol code.

I'd appreciate someone letting me know if this problem (a) can even be fixed, 
given the degree of cross-connection I see in the bcol code, and (b) if it can, 
then by when.

Thanks
Ralph





Re: [OMPI devel] singleton appears to be broken

2014-02-06 Thread Ralph Castain
Interesting - does it happen in finalize, or in the middle of execution?


On Feb 6, 2014, at 5:57 PM, George Bosilca  wrote:

> Out of 150 runs I could reproduce it once. When it failed I got exactly the 
> same assert:
> 
> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) 
> == ((opal_object_t *) (recv))->obj_magic_id’ failed.
> 
> A quick look at the code indicates it is in a rather obscure execution path, 
> when one cancel a pending receive. The assert indicates that the receive 
> object was already destroyed (somewhere else) when it got removed from the 
> orte_rml_base.posted_recvs queue.
> 
> George.
> 
> 
> On Feb 7, 2014, at 02:22 , George Bosilca  wrote:
> 
>> A rather long configure line:
>> 
>> ./configure —enable-picky —enable-debug —enable-coverage 
>> —disable-heterogeneous —enable-visibility —enable-contrib-no-build=vt 
>> —enable-mpirun-prefix-by-default --disable-mpi-cxx --with-cma 
>> --enable-static 
>> --enable-mca-no-build=plm-tm,ess-tm,ras-tm,plm-tm,ras-slurm,ess-slurm,plm-slurm,btl-sctp
>> 
>> And hellow_world.c from ompi-tests compiled using: 
>> mpicc -g —coverage hello.c -o hello
>> 
>> George.
>> 
>> 
>> On Feb 7, 2014, at 01:11 , Ralph Castain  wrote:
>> 
>>> Oh, should have noted: that's on both trunk and 1.7.4
>>> 
>>> On Feb 6, 2014, at 4:10 PM, Ralph Castain  wrote:
>>> 
 Works for me on Mac and Linux/Centos6.2 as well
 
 
 On Feb 6, 2014, at 4:00 PM, Jeff Squyres (jsquyres)  
 wrote:
 
> I'm unable to replicate on Linux/RHEL/64 bit with a trunk build.  How did 
> you configure?  Here's my configure:
> 
> ./configure --prefix=/home/jsquyres/bogus --disable-vt 
> --enable-mpirun-prefix-by-default --disable-mpi-fortran
> 
> Does this happen with every run?
> 
> 
> On Feb 6, 2014, at 6:53 PM, George Bosilca  wrote:
> 
>> A singleton hello_world assert with the following output:
>> 
>> Warning :: opal_list_remove_item - the item 0x1211fc0 is not on the list 
>> 0x7f2cd9161ae0
>> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
>> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 
>> 0xdeafbeedULL) == ((opal_object_t *) (recv))->obj_magic_id' failed.
>> [dancer:00698] *** Process received signal ***
>> [dancer:00698] Signal: Aborted (6)
>> [dancer:00698] Signal code:  (-6)
>> [dancer:00698] [ 0] /lib64/libpthread.so.0[0x3d8480f710]
>> [dancer:00698] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d83c32925]
>> [dancer:00698] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d83c34105]
>> [dancer:00698] [ 3] /lib64/libc.so.6[0x3d83c2ba4e]
>> [dancer:00698] [ 4] 
>> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3d83c2bb10]
>> [dancer:00698] [ 5] 
>> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(orte_rml_base_post_recv+0x252)[0x7f2cd8e76d55]
>> [dancer:00698] [ 6] 
>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcca5d)[0x7f2cd89e8a5d]
>> [dancer:00698] [ 7] 
>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcce53)[0x7f2cd89e8e53]
>> [dancer:00698] [ 8] 
>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x4eb)[0x7f2cd89e99ea]
>> [dancer:00698] [ 9] 
>> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(+0x28725)[0x7f2cd8d1b725]
>> [dancer:00698] [10] /lib64/libpthread.so.0[0x3d848079d1]
>> [dancer:00698] [11] /lib64/libc.so.6(clone+0x6d)[0x3d83ce8b6d]
>> [dancer:00698] *** End of error message ***
>> 
>> The same executable run via mpirun with a single process succeed. This 
>> is with trunk, I did not tried with the release.
>> 
>> George.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] singleton appears to be broken

2014-02-06 Thread George Bosilca
Out of 150 runs I could reproduce it once. When it failed I got exactly the 
same assert:

hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (recv))->obj_magic_id’ failed.

A quick look at the code indicates it is in a rather obscure execution path, 
when one cancel a pending receive. The assert indicates that the receive object 
was already destroyed (somewhere else) when it got removed from the 
orte_rml_base.posted_recvs queue.

George.


On Feb 7, 2014, at 02:22 , George Bosilca  wrote:

> A rather long configure line:
> 
> ./configure —enable-picky —enable-debug —enable-coverage 
> —disable-heterogeneous —enable-visibility —enable-contrib-no-build=vt 
> —enable-mpirun-prefix-by-default --disable-mpi-cxx --with-cma --enable-static 
> --enable-mca-no-build=plm-tm,ess-tm,ras-tm,plm-tm,ras-slurm,ess-slurm,plm-slurm,btl-sctp
> 
> And hellow_world.c from ompi-tests compiled using: 
>  mpicc -g —coverage hello.c -o hello
> 
>  George.
> 
> 
> On Feb 7, 2014, at 01:11 , Ralph Castain  wrote:
> 
>> Oh, should have noted: that's on both trunk and 1.7.4
>> 
>> On Feb 6, 2014, at 4:10 PM, Ralph Castain  wrote:
>> 
>>> Works for me on Mac and Linux/Centos6.2 as well
>>> 
>>> 
>>> On Feb 6, 2014, at 4:00 PM, Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
 I'm unable to replicate on Linux/RHEL/64 bit with a trunk build.  How did 
 you configure?  Here's my configure:
 
 ./configure --prefix=/home/jsquyres/bogus --disable-vt 
 --enable-mpirun-prefix-by-default --disable-mpi-fortran
 
 Does this happen with every run?
 
 
 On Feb 6, 2014, at 6:53 PM, George Bosilca  wrote:
 
> A singleton hello_world assert with the following output:
> 
> Warning :: opal_list_remove_item - the item 0x1211fc0 is not on the list 
> 0x7f2cd9161ae0
> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 
> 0xdeafbeedULL) == ((opal_object_t *) (recv))->obj_magic_id' failed.
> [dancer:00698] *** Process received signal ***
> [dancer:00698] Signal: Aborted (6)
> [dancer:00698] Signal code:  (-6)
> [dancer:00698] [ 0] /lib64/libpthread.so.0[0x3d8480f710]
> [dancer:00698] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d83c32925]
> [dancer:00698] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d83c34105]
> [dancer:00698] [ 3] /lib64/libc.so.6[0x3d83c2ba4e]
> [dancer:00698] [ 4] 
> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3d83c2bb10]
> [dancer:00698] [ 5] 
> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(orte_rml_base_post_recv+0x252)[0x7f2cd8e76d55]
> [dancer:00698] [ 6] 
> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcca5d)[0x7f2cd89e8a5d]
> [dancer:00698] [ 7] 
> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcce53)[0x7f2cd89e8e53]
> [dancer:00698] [ 8] 
> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x4eb)[0x7f2cd89e99ea]
> [dancer:00698] [ 9] 
> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(+0x28725)[0x7f2cd8d1b725]
> [dancer:00698] [10] /lib64/libpthread.so.0[0x3d848079d1]
> [dancer:00698] [11] /lib64/libc.so.6(clone+0x6d)[0x3d83ce8b6d]
> [dancer:00698] *** End of error message ***
> 
> The same executable run via mpirun with a single process succeed. This is 
> with trunk, I did not tried with the release.
> 
> George.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 
 -- 
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to: 
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 



Re: [OMPI devel] singleton appears to be broken

2014-02-06 Thread George Bosilca
A rather long configure line:

./configure —enable-picky —enable-debug —enable-coverage —disable-heterogeneous 
—enable-visibility —enable-contrib-no-build=vt —enable-mpirun-prefix-by-default 
--disable-mpi-cxx --with-cma --enable-static 
--enable-mca-no-build=plm-tm,ess-tm,ras-tm,plm-tm,ras-slurm,ess-slurm,plm-slurm,btl-sctp

And hellow_world.c from ompi-tests compiled using: 
  mpicc -g —coverage hello.c -o hello

  George.


On Feb 7, 2014, at 01:11 , Ralph Castain  wrote:

> Oh, should have noted: that's on both trunk and 1.7.4
> 
> On Feb 6, 2014, at 4:10 PM, Ralph Castain  wrote:
> 
>> Works for me on Mac and Linux/Centos6.2 as well
>> 
>> 
>> On Feb 6, 2014, at 4:00 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>>> I'm unable to replicate on Linux/RHEL/64 bit with a trunk build.  How did 
>>> you configure?  Here's my configure:
>>> 
>>> ./configure --prefix=/home/jsquyres/bogus --disable-vt 
>>> --enable-mpirun-prefix-by-default --disable-mpi-fortran
>>> 
>>> Does this happen with every run?
>>> 
>>> 
>>> On Feb 6, 2014, at 6:53 PM, George Bosilca  wrote:
>>> 
 A singleton hello_world assert with the following output:
 
 Warning :: opal_list_remove_item - the item 0x1211fc0 is not on the list 
 0x7f2cd9161ae0
 hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
 orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 
 0xdeafbeedULL) == ((opal_object_t *) (recv))->obj_magic_id' failed.
 [dancer:00698] *** Process received signal ***
 [dancer:00698] Signal: Aborted (6)
 [dancer:00698] Signal code:  (-6)
 [dancer:00698] [ 0] /lib64/libpthread.so.0[0x3d8480f710]
 [dancer:00698] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d83c32925]
 [dancer:00698] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d83c34105]
 [dancer:00698] [ 3] /lib64/libc.so.6[0x3d83c2ba4e]
 [dancer:00698] [ 4] 
 /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3d83c2bb10]
 [dancer:00698] [ 5] 
 /home/bosilca/opt/trunk/lib/libopen-rte.so.0(orte_rml_base_post_recv+0x252)[0x7f2cd8e76d55]
 [dancer:00698] [ 6] 
 /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcca5d)[0x7f2cd89e8a5d]
 [dancer:00698] [ 7] 
 /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcce53)[0x7f2cd89e8e53]
 [dancer:00698] [ 8] 
 /home/bosilca/opt/trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x4eb)[0x7f2cd89e99ea]
 [dancer:00698] [ 9] 
 /home/bosilca/opt/trunk/lib/libopen-rte.so.0(+0x28725)[0x7f2cd8d1b725]
 [dancer:00698] [10] /lib64/libpthread.so.0[0x3d848079d1]
 [dancer:00698] [11] /lib64/libc.so.6(clone+0x6d)[0x3d83ce8b6d]
 [dancer:00698] *** End of error message ***
 
 The same executable run via mpirun with a single process succeed. This is 
 with trunk, I did not tried with the release.
 
 George.
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] singleton appears to be broken

2014-02-06 Thread Ralph Castain
Works for me on Mac and Linux/Centos6.2 as well


On Feb 6, 2014, at 4:00 PM, Jeff Squyres (jsquyres)  wrote:

> I'm unable to replicate on Linux/RHEL/64 bit with a trunk build.  How did you 
> configure?  Here's my configure:
> 
> ./configure --prefix=/home/jsquyres/bogus --disable-vt 
> --enable-mpirun-prefix-by-default --disable-mpi-fortran
> 
> Does this happen with every run?
> 
> 
> On Feb 6, 2014, at 6:53 PM, George Bosilca  wrote:
> 
>> A singleton hello_world assert with the following output:
>> 
>> Warning :: opal_list_remove_item - the item 0x1211fc0 is not on the list 
>> 0x7f2cd9161ae0
>> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
>> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) 
>> == ((opal_object_t *) (recv))->obj_magic_id' failed.
>> [dancer:00698] *** Process received signal ***
>> [dancer:00698] Signal: Aborted (6)
>> [dancer:00698] Signal code:  (-6)
>> [dancer:00698] [ 0] /lib64/libpthread.so.0[0x3d8480f710]
>> [dancer:00698] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d83c32925]
>> [dancer:00698] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d83c34105]
>> [dancer:00698] [ 3] /lib64/libc.so.6[0x3d83c2ba4e]
>> [dancer:00698] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3d83c2bb10]
>> [dancer:00698] [ 5] 
>> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(orte_rml_base_post_recv+0x252)[0x7f2cd8e76d55]
>> [dancer:00698] [ 6] 
>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcca5d)[0x7f2cd89e8a5d]
>> [dancer:00698] [ 7] 
>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcce53)[0x7f2cd89e8e53]
>> [dancer:00698] [ 8] 
>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x4eb)[0x7f2cd89e99ea]
>> [dancer:00698] [ 9] 
>> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(+0x28725)[0x7f2cd8d1b725]
>> [dancer:00698] [10] /lib64/libpthread.so.0[0x3d848079d1]
>> [dancer:00698] [11] /lib64/libc.so.6(clone+0x6d)[0x3d83ce8b6d]
>> [dancer:00698] *** End of error message ***
>> 
>> The same executable run via mpirun with a single process succeed. This is 
>> with trunk, I did not tried with the release.
>> 
>> George.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] singleton appears to be broken

2014-02-06 Thread Jeff Squyres (jsquyres)
I'm unable to replicate on Linux/RHEL/64 bit with a trunk build.  How did you 
configure?  Here's my configure:

./configure --prefix=/home/jsquyres/bogus --disable-vt 
--enable-mpirun-prefix-by-default --disable-mpi-fortran

Does this happen with every run?


On Feb 6, 2014, at 6:53 PM, George Bosilca  wrote:

> A singleton hello_world assert with the following output:
> 
> Warning :: opal_list_remove_item - the item 0x1211fc0 is not on the list 
> 0x7f2cd9161ae0
> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) 
> == ((opal_object_t *) (recv))->obj_magic_id' failed.
> [dancer:00698] *** Process received signal ***
> [dancer:00698] Signal: Aborted (6)
> [dancer:00698] Signal code:  (-6)
> [dancer:00698] [ 0] /lib64/libpthread.so.0[0x3d8480f710]
> [dancer:00698] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d83c32925]
> [dancer:00698] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d83c34105]
> [dancer:00698] [ 3] /lib64/libc.so.6[0x3d83c2ba4e]
> [dancer:00698] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3d83c2bb10]
> [dancer:00698] [ 5] 
> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(orte_rml_base_post_recv+0x252)[0x7f2cd8e76d55]
> [dancer:00698] [ 6] 
> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcca5d)[0x7f2cd89e8a5d]
> [dancer:00698] [ 7] 
> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcce53)[0x7f2cd89e8e53]
> [dancer:00698] [ 8] 
> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x4eb)[0x7f2cd89e99ea]
> [dancer:00698] [ 9] 
> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(+0x28725)[0x7f2cd8d1b725]
> [dancer:00698] [10] /lib64/libpthread.so.0[0x3d848079d1]
> [dancer:00698] [11] /lib64/libc.so.6(clone+0x6d)[0x3d83ce8b6d]
> [dancer:00698] *** End of error message ***
> 
> The same executable run via mpirun with a single process succeed. This is 
> with trunk, I did not tried with the release.
> 
> George.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] singleton appears to be broken

2014-02-06 Thread George Bosilca
A singleton hello_world assert with the following output:

 Warning :: opal_list_remove_item - the item 0x1211fc0 is not on the list 
0x7f2cd9161ae0
hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (recv))->obj_magic_id' failed.
[dancer:00698] *** Process received signal ***
[dancer:00698] Signal: Aborted (6)
[dancer:00698] Signal code:  (-6)
[dancer:00698] [ 0] /lib64/libpthread.so.0[0x3d8480f710]
[dancer:00698] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d83c32925]
[dancer:00698] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d83c34105]
[dancer:00698] [ 3] /lib64/libc.so.6[0x3d83c2ba4e]
[dancer:00698] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3d83c2bb10]
[dancer:00698] [ 5] 
/home/bosilca/opt/trunk/lib/libopen-rte.so.0(orte_rml_base_post_recv+0x252)[0x7f2cd8e76d55]
[dancer:00698] [ 6] 
/home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcca5d)[0x7f2cd89e8a5d]
[dancer:00698] [ 7] 
/home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcce53)[0x7f2cd89e8e53]
[dancer:00698] [ 8] 
/home/bosilca/opt/trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x4eb)[0x7f2cd89e99ea]
[dancer:00698] [ 9] 
/home/bosilca/opt/trunk/lib/libopen-rte.so.0(+0x28725)[0x7f2cd8d1b725]
[dancer:00698] [10] /lib64/libpthread.so.0[0x3d848079d1]
[dancer:00698] [11] /lib64/libc.so.6(clone+0x6d)[0x3d83ce8b6d]
[dancer:00698] *** End of error message ***

The same executable run via mpirun with a single process succeed. This is with 
trunk, I did not tried with the release.

George.

Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Ralph Castain

On Feb 6, 2014, at 2:16 PM, Adrian Reber  wrote:

> Josh explained it to me a few days ago, that after a checkpoint has been
> received TCP should no longer be used to not lose any messages. The
> communication happens over named pipes and therefore (I think) OOB
> ft_event() is used to quite anything besides the pipes. This all seems
> to work but I was just confused as the functions for ft_event()
> in oob/tcp and oob/ud do not seem to contain any functionality.
> 
> So do I try to fix the ft_event() function in oob/base/ to call the
> registered ft_event() function which does nothing or do I just remove
> the call to orte oob ft_event().

Sounds like you'll need to tell the OOB components to stop processing messages, 
so that will require that you insert an event into the system. You have to 
account for two things:

(a) the OOB base and OOB components are operating on the orte_event_base, but

(b) each OOB component can have multiple active modules (one per NIC) that are 
operating on their own event base/thread.

So you have to start by pushing an event that calls the OOB base, which then 
loops across the components calling their ft_event interface. Each component 
would then have to create an event for each active module, inserting that event 
into the module's event base/thread. When activated, each module would have to 
shutdown its message engine, and activate another event to notify its component 
that all is quiet.

Once a component finds out that all its modules are quiet, it would then have 
to activate an event to the OOB base. Once the OOB base sees all components 
report quiet, then it would have to activate an event to take you to the next 
step in your process.

In other words, you need to turn the quieting process into its own set of 
states and run it through the state machine. This is the only way to guarantee 
that you'll keep things orderly, and is the major change needed in the C/R 
procedure as it flows thru ORTE. You can't just progress thru a set of function 
calls as you'll inevitably run into a roadblock requiring that you wait for an 
event-driven process to complete.

HTH
Ralph

> 
> On Thu, Feb 06, 2014 at 10:49:25AM -0800, Ralph Castain wrote:
>> The only reason I can think of for an OOB ft-event would be to tell the OOB 
>> to stop sending any messages. You would need to push that into the event 
>> library and use a callback event to let you know when it was done.
>> 
>> Of course, once you did that, the OOB would no longer be available to, for 
>> example, tell the local daemon that the app is ready for checkpoint :-)
>> 
>> Afraid I'll have to defer to Josh H for any further guidance.
>> 
>> 
>> On Feb 6, 2014, at 8:15 AM, Adrian Reber  wrote:
>> 
>>> When I initially made the C/R code compile again I made following
>>> change:
>>> 
>>> diff --git a/orte/mca/rml/oob/rml_oob_component.c 
>>> b/orte/mca/rml/oob/rml_oob_component.c
>>> index f0b22fc..90ed086 100644
>>> --- a/orte/mca/rml/oob/rml_oob_component.c
>>> +++ b/orte/mca/rml/oob/rml_oob_component.c
>>> @@ -185,8 +185,7 @@ orte_rml_oob_ft_event(int state) {
>>>;
>>>}
>>> 
>>> -if( ORTE_SUCCESS != 
>>> -(ret = orte_oob.ft_event(state)) ) {
>>> +if( ORTE_SUCCESS != (ret = orte_rml_oob_ft_event(state)) ) {
>>>ORTE_ERROR_LOG(ret);
>>>exit_status = ret;
>>>goto cleanup;
>>> 
>>> 
>>> 
>>> This is, of course, wrong. Now the function calls itself in a loop until
>>> it crashes. Looking at orte/mca/oob there is still a ft_event()
>>> function, but it is disabled using "#if 0". Looking at other functions
>>> it seems I would need to create something like
>>> 
>>> #define ORTE_OOB_FT_EVENT(m)
>>> 
>>> Looking at the modules in orte/mca/oob/ it seems ft_event is implemented
>>> in some places but it never seems to have any real functionality. Is
>>> ft_event() actually needed there?
>>> 
>>> Adrian
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
Josh explained it to me a few days ago, that after a checkpoint has been
received TCP should no longer be used to not lose any messages. The
communication happens over named pipes and therefore (I think) OOB
ft_event() is used to quite anything besides the pipes. This all seems
to work but I was just confused as the functions for ft_event()
in oob/tcp and oob/ud do not seem to contain any functionality.

So do I try to fix the ft_event() function in oob/base/ to call the
registered ft_event() function which does nothing or do I just remove
the call to orte oob ft_event().

On Thu, Feb 06, 2014 at 10:49:25AM -0800, Ralph Castain wrote:
> The only reason I can think of for an OOB ft-event would be to tell the OOB 
> to stop sending any messages. You would need to push that into the event 
> library and use a callback event to let you know when it was done.
> 
> Of course, once you did that, the OOB would no longer be available to, for 
> example, tell the local daemon that the app is ready for checkpoint :-)
> 
> Afraid I'll have to defer to Josh H for any further guidance.
> 
> 
> On Feb 6, 2014, at 8:15 AM, Adrian Reber  wrote:
> 
> > When I initially made the C/R code compile again I made following
> > change:
> > 
> > diff --git a/orte/mca/rml/oob/rml_oob_component.c 
> > b/orte/mca/rml/oob/rml_oob_component.c
> > index f0b22fc..90ed086 100644
> > --- a/orte/mca/rml/oob/rml_oob_component.c
> > +++ b/orte/mca/rml/oob/rml_oob_component.c
> > @@ -185,8 +185,7 @@ orte_rml_oob_ft_event(int state) {
> > ;
> > }
> > 
> > -if( ORTE_SUCCESS != 
> > -(ret = orte_oob.ft_event(state)) ) {
> > +if( ORTE_SUCCESS != (ret = orte_rml_oob_ft_event(state)) ) {
> > ORTE_ERROR_LOG(ret);
> > exit_status = ret;
> > goto cleanup;
> > 
> > 
> > 
> > This is, of course, wrong. Now the function calls itself in a loop until
> > it crashes. Looking at orte/mca/oob there is still a ft_event()
> > function, but it is disabled using "#if 0". Looking at other functions
> > it seems I would need to create something like
> > 
> > #define ORTE_OOB_FT_EVENT(m)
> > 
> > Looking at the modules in orte/mca/oob/ it seems ft_event is implemented
> > in some places but it never seems to have any real functionality. Is
> > ft_event() actually needed there?
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Ralph Castain
The only reason I can think of for an OOB ft-event would be to tell the OOB to 
stop sending any messages. You would need to push that into the event library 
and use a callback event to let you know when it was done.

Of course, once you did that, the OOB would no longer be available to, for 
example, tell the local daemon that the app is ready for checkpoint :-)

Afraid I'll have to defer to Josh H for any further guidance.


On Feb 6, 2014, at 8:15 AM, Adrian Reber  wrote:

> When I initially made the C/R code compile again I made following
> change:
> 
> diff --git a/orte/mca/rml/oob/rml_oob_component.c 
> b/orte/mca/rml/oob/rml_oob_component.c
> index f0b22fc..90ed086 100644
> --- a/orte/mca/rml/oob/rml_oob_component.c
> +++ b/orte/mca/rml/oob/rml_oob_component.c
> @@ -185,8 +185,7 @@ orte_rml_oob_ft_event(int state) {
> ;
> }
> 
> -if( ORTE_SUCCESS != 
> -(ret = orte_oob.ft_event(state)) ) {
> +if( ORTE_SUCCESS != (ret = orte_rml_oob_ft_event(state)) ) {
> ORTE_ERROR_LOG(ret);
> exit_status = ret;
> goto cleanup;
> 
> 
> 
> This is, of course, wrong. Now the function calls itself in a loop until
> it crashes. Looking at orte/mca/oob there is still a ft_event()
> function, but it is disabled using "#if 0". Looking at other functions
> it seems I would need to create something like
> 
> #define ORTE_OOB_FT_EVENT(m)
> 
> Looking at the modules in orte/mca/oob/ it seems ft_event is implemented
> in some places but it never seems to have any real functionality. Is
> ft_event() actually needed there?
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] mpirun oddity w/ PBS on an SGI UV

2014-02-06 Thread Ralph Castain
crud - sorry about that! old man can't even remember his own param namesigh

Thanks for checking it
Ralph

On Feb 6, 2014, at 9:47 AM, Paul Hargrove  wrote:

> Ralph,
> 
> It worked on my second try, when I spelled it "ras_tm_smp" :-)
> 
> Thanks,
> -Paul
> 
> 
> 
> On Wed, Feb 5, 2014 at 11:59 AM, Paul Hargrove  wrote:
> Ralph,
> 
> I will try to build tonight's trunk tarball and then test a run tomorrow.
> Please ping me if I don't post my results by Thu evening (PST).
> 
> -Paul
> 
> 
> On Wed, Feb 5, 2014 at 7:52 AM, Ralph Castain  wrote:
> I added this to the trunk in r30568 - a new MCA param "ras_tm_smp_mode" will 
> tell us to use the PBS_PPN envar to get the number of slots allocated per 
> node. We then just use the PBS_Nodefile to read the names of the nodes, which 
> I expect will be one for each partition.
> 
> Let me know if this solves the problem - I scheduled it for 1.7.5
> 
> Thanks!
> Ralph
> 
> On Jan 31, 2014, at 4:33 PM, Ralph Castain  wrote:
> 
>> No worries about PBS itself - better to allow you to just run this way. Easy 
>> to add a switch for this purpose.
>> 
>> For now, just add --oversubscribe to the command line
>> 
>> On Jan 31, 2014, at 3:32 PM, Paul Hargrove  wrote:
>> 
>>> Ralph,
>>> 
>>> The mods may have been done by the staff at PSC rather than by SGI.
>>> Note the "_psc" suffix:
>>> $ which pbsnodes
>>> /usr/local/packages/torque/2.3.13_psc/bin/pbsnodes
>>> 
>>> Their sources appear to be available in the f/s too.
>>> Using "tar -d" to compare that to the pristine torque-2.3.13 tarball show 
>>> the following files were modified:
>>> torque-2.3.13/src/resmom/job_func.c
>>> torque-2.3.13/src/resmom/mom_main.c
>>> torque-2.3.13/src/resmom/requests.c
>>> torque-2.3.13/src/resmom/linux/mom_mach.h
>>> torque-2.3.13/src/resmom/linux/mom_mach.c
>>> torque-2.3.13/src/resmom/linux/cpuset.c
>>> torque-2.3.13/src/resmom/start_exec.c
>>> torque-2.3.13/src/scheduler.tcl/pbs_sched.c
>>> torque-2.3.13/src/cmds/qalter.c
>>> torque-2.3.13/src/cmds/qsub.c
>>> torque-2.3.13/src/cmds/qstat.c
>>> torque-2.3.13/src/server/resc_def_all.c
>>> torque-2.3.13/src/server/req_quejob.c
>>> torque-2.3.13/torque.spec
>>> 
>>> I'll provide what assistance I can in testing.
>>> That includes providing (off-list) the actual diffs of PSC's torque against 
>>> the tarball, if desired.
>>> 
>>> In the meantime, since -npernode didn't work, what is the right way to say:
>>>   "I have 1 slot but I want to overcommit and run 16 mpi ranks".
>>> 
>>> -Paul
>>> 
>>> 
>>> On Fri, Jan 31, 2014 at 3:20 PM, Ralph Castain  wrote:
>>> 
>>> On Jan 31, 2014, at 3:13 PM, Paul Hargrove  wrote:
>>> 
 Ralph,
 
 As I said this is NOT a cluster - it is a 4k-core shared memory machine.
>>> 
>>> I understood - that wasn't the nature of my question
>>> 
 TORQUE is allocating cpus (time-shared mode, IIRC), not nodes.
 So, there is always exactly one line in $PBS_NODESFILE.
>>> 
>>> Interesting - because that isn't the standard way Torque behaves. It is 
>>> supposed to put one line/slot in the nodefile, each line containing the 
>>> name of the node. Clearly, SGI has reconfigured Torque to do something 
>>> different.
>>> 
 
 The system runs as 2 partitions of 2k-cores each.
 So, the contents odf$PBS_NODESFILE has exactly 2 possible values, each 1 
 line.
 
 The values of PBS_PPN and PBS_NCPUS both reflect the size of the 
 allocation.
 
 At a minimum, shouldn't Open MPI be multiplying the lines in 
 $PBS_NODESFILE by the value of $PBS_PPN?
>>> 
>>> No, as above, that isn't the way Torque generally behaves. It would appear 
>>> that we need a "switch" here to handle SGI's modifications. Should be 
>>> doable - just haven't had anyone using an SGI machine before :-)
>>> 
 
 Additionally, when I try "mpirun -npernode 16 ./ring_c" I am still told 
 there are not enough slots.
 Shouldn't that be working with 1 line is $PBS_NODESFILE?
 
 -Paul
 
 
 
 
 On Fri, Jan 31, 2014 at 2:47 PM, Ralph Castain  wrote:
 We read the nodes from the PBS_NODEFILE, Paul - can you pass that along?
 
 On Jan 31, 2014, at 2:33 PM, Paul Hargrove  wrote:
 
> I am trying to test the trunk on an SGI UV (to validate Nathan's port of 
> btl:vader to SGI's variant of xpmem).
> 
> At configure time, PBS's TM support was correctly located.
> 
> My PBS batch script includes
>   #PBS -l ncpus=16
> because that is what this installation requires (not nodes, mppnodes, or 
> anything like that).
> One is allocating cpus on a large shared-memory machine, not a set of 
> nodes in a cluster.
> 
> However, this appears to be causing mpirun to think I have just 1 slot:
> 
> + mpirun -np 2 ./ring_c
> 

Re: [OMPI devel] mpirun oddity w/ PBS on an SGI UV

2014-02-06 Thread Paul Hargrove
Ralph,

It worked on my second try, when I spelled it "ras_tm_smp" :-)

Thanks,
-Paul



On Wed, Feb 5, 2014 at 11:59 AM, Paul Hargrove  wrote:

> Ralph,
>
> I will try to build tonight's trunk tarball and then test a run tomorrow.
> Please ping me if I don't post my results by Thu evening (PST).
>
> -Paul
>
>
> On Wed, Feb 5, 2014 at 7:52 AM, Ralph Castain  wrote:
>
>> I added this to the trunk in r30568 - a new MCA param "ras_tm_smp_mode"
>> will tell us to use the PBS_PPN envar to get the number of slots allocated
>> per node. We then just use the PBS_Nodefile to read the names of the nodes,
>> which I expect will be one for each partition.
>>
>> Let me know if this solves the problem - I scheduled it for 1.7.5
>>
>> Thanks!
>> Ralph
>>
>> On Jan 31, 2014, at 4:33 PM, Ralph Castain  wrote:
>>
>> No worries about PBS itself - better to allow you to just run this way.
>> Easy to add a switch for this purpose.
>>
>> For now, just add --oversubscribe to the command line
>>
>> On Jan 31, 2014, at 3:32 PM, Paul Hargrove  wrote:
>>
>> Ralph,
>>
>> The mods may have been done by the staff at PSC rather than by SGI.
>> Note the "_psc" suffix:
>> $ which pbsnodes
>> /usr/local/packages/torque/2.3.13_psc/bin/pbsnodes
>>
>> Their sources appear to be available in the f/s too.
>> Using "tar -d" to compare that to the pristine torque-2.3.13 tarball show
>> the following files were modified:
>> torque-2.3.13/src/resmom/job_func.c
>> torque-2.3.13/src/resmom/mom_main.c
>> torque-2.3.13/src/resmom/requests.c
>> torque-2.3.13/src/resmom/linux/mom_mach.h
>> torque-2.3.13/src/resmom/linux/mom_mach.c
>> torque-2.3.13/src/resmom/linux/cpuset.c
>> torque-2.3.13/src/resmom/start_exec.c
>> torque-2.3.13/src/scheduler.tcl/pbs_sched.c
>> torque-2.3.13/src/cmds/qalter.c
>> torque-2.3.13/src/cmds/qsub.c
>> torque-2.3.13/src/cmds/qstat.c
>> torque-2.3.13/src/server/resc_def_all.c
>> torque-2.3.13/src/server/req_quejob.c
>> torque-2.3.13/torque.spec
>>
>> I'll provide what assistance I can in testing.
>> That includes providing (off-list) the actual diffs of PSC's torque
>> against the tarball, if desired.
>>
>> In the meantime, since -npernode didn't work, what is the right way to
>> say:
>>"I have 1 slot but I want to overcommit and run 16 mpi ranks".
>>
>> -Paul
>>
>>
>> On Fri, Jan 31, 2014 at 3:20 PM, Ralph Castain  wrote:
>>
>>>
>>> On Jan 31, 2014, at 3:13 PM, Paul Hargrove  wrote:
>>>
>>> Ralph,
>>>
>>> As I said this is NOT a cluster - it is a 4k-core shared memory machine.
>>>
>>>
>>> I understood - that wasn't the nature of my question
>>>
>>> TORQUE is allocating cpus (time-shared mode, IIRC), not nodes.
>>> So, there is always exactly one line in $PBS_NODESFILE.
>>>
>>>
>>> Interesting - because that isn't the standard way Torque behaves. It is
>>> supposed to put one line/slot in the nodefile, each line containing the
>>> name of the node. Clearly, SGI has reconfigured Torque to do something
>>> different.
>>>
>>>
>>> The system runs as 2 partitions of 2k-cores each.
>>> So, the contents odf$PBS_NODESFILE has exactly 2 possible values, each 1
>>> line.
>>>
>>> The values of PBS_PPN and PBS_NCPUS both reflect the size of the
>>> allocation.
>>>
>>> At a minimum, shouldn't Open MPI be multiplying the lines in
>>> $PBS_NODESFILE by the value of $PBS_PPN?
>>>
>>>
>>> No, as above, that isn't the way Torque generally behaves. It would
>>> appear that we need a "switch" here to handle SGI's modifications. Should
>>> be doable - just haven't had anyone using an SGI machine before :-)
>>>
>>>
>>> Additionally, when I try "mpirun -npernode 16 ./ring_c" I am still told
>>> there are not enough slots.
>>> Shouldn't that be working with 1 line is $PBS_NODESFILE?
>>>
>>> -Paul
>>>
>>>
>>>
>>>
>>> On Fri, Jan 31, 2014 at 2:47 PM, Ralph Castain  wrote:
>>>
 We read the nodes from the PBS_NODEFILE, Paul - can you pass that along?

 On Jan 31, 2014, at 2:33 PM, Paul Hargrove  wrote:

 I am trying to test the trunk on an SGI UV (to validate Nathan's port
 of btl:vader to SGI's variant of xpmem).

 At configure time, PBS's TM support was correctly located.

 My PBS batch script includes
   #PBS -l ncpus=16
 because that is what this installation requires (not nodes, mppnodes,
 or anything like that).
 One is allocating cpus on a large shared-memory machine, not a set of
 nodes in a cluster.

 However, this appears to be causing mpirun to think I have just 1 slot:

 + mpirun -np 2 ./ring_c

 --
 There are not enough slots available in the system to satisfy the 2
 slots
 that were requested by the application:
   ./ring_c

 Either request fewer slots for your application, or make more slots
 available

Re: [OMPI devel] [OMPI svn] svn:open-mpi r30571 - trunk/ompi/runtime

2014-02-06 Thread Ralph Castain
Kewl - I'll add it in the next wave. Meantime, we can revert this one

Thanks!
Ralph

On Feb 6, 2014, at 9:18 AM, Joshua Ladd  wrote:

> It’s been CMRed, but scheduled for 1.7.5
>  
> https://svn.open-mpi.org/trac/ompi/ticket/4185
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Mike Dubman
> Sent: Thursday, February 06, 2014 12:17 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r30571 - trunk/ompi/runtime
>  
> It seems that similar code in not in v1.7 tree.
>  
> 
> On Thu, Feb 6, 2014 at 2:40 PM, George Bosilca  wrote:
> This commit is unnecessary. The call to delete_proc is already there, few 
> lines above your own patch. It was introduced on Jan 26 2014 with commit 
> https://svn.open-mpi.org/trac/ompi/changeset/30430.
> 
>   George.
> 
> 
> 
> On Feb 6, 2014, at 09:38 , svn-commit-mai...@open-mpi.org wrote:
> 
> > Author: miked (Mike Dubman)
> > Date: 2014-02-06 03:38:32 EST (Thu, 06 Feb 2014)
> > New Revision: 30571
> > URL: https://svn.open-mpi.org/trac/ompi/changeset/30571
> >
> > Log:
> > OMPI: add call to del_procs
> >
> > fixed by AlexM, reviewed by miked
> > cmr=v1.7.5:reviewer=ompi-rm1.7
> >
> > Text files modified:
> >   trunk/ompi/runtime/ompi_mpi_finalize.c |15 +++
> >   1 files changed, 15 insertions(+), 0 deletions(-)
> >
> > Modified: trunk/ompi/runtime/ompi_mpi_finalize.c
> > ==
> > --- trunk/ompi/runtime/ompi_mpi_finalize.cWed Feb  5 17:49:26 2014  
> >   (r30570)
> > +++ trunk/ompi/runtime/ompi_mpi_finalize.c2014-02-06 03:38:32 EST (Thu, 
> > 06 Feb 2014)  (r30571)
> > @@ -94,6 +94,9 @@
> > opal_list_item_t *item;
> > struct timeval ompistart, ompistop;
> > ompi_rte_collective_t *coll;
> > +ompi_proc_t** procs;
> > +size_t nprocs;
> > +
> >
> > /* Be a bit social if an erroneous program calls MPI_FINALIZE in
> >two different threads, otherwise we may deadlock in
> > @@ -150,6 +153,18 @@
> >MPI lifetime, to get better latency when not using TCP */
> > opal_progress_event_users_increment();
> >
> > +
> > +if (NULL == (procs = ompi_proc_world())) {
> > +return OMPI_ERROR;
> > +}
> > +
> > +if (OMPI_SUCCESS != (ret = MCA_PML_CALL(del_procs(procs, nprocs {
> > +free(procs);
> > +return ret;
> > +}
> > +free(procs);
> > +
> > +
> > /* check to see if we want timing information */
> > if (ompi_enable_timing != 0 && 0 == OMPI_PROC_MY_NAME->vpid) {
> > gettimeofday(, NULL);
> > ___
> > svn mailing list
> > s...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/svn
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn] svn:open-mpi r30571 - trunk/ompi/runtime

2014-02-06 Thread Joshua Ladd
It's been CMRed, but scheduled for 1.7.5

https://svn.open-mpi.org/trac/ompi/ticket/4185

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Mike Dubman
Sent: Thursday, February 06, 2014 12:17 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r30571 - trunk/ompi/runtime

It seems that similar code in not in v1.7 tree.

On Thu, Feb 6, 2014 at 2:40 PM, George Bosilca 
> wrote:
This commit is unnecessary. The call to delete_proc is already there, few lines 
above your own patch. It was introduced on Jan 26 2014 with commit 
https://svn.open-mpi.org/trac/ompi/changeset/30430.

  George.



On Feb 6, 2014, at 09:38 , 
svn-commit-mai...@open-mpi.org wrote:

> Author: miked (Mike Dubman)
> Date: 2014-02-06 03:38:32 EST (Thu, 06 Feb 2014)
> New Revision: 30571
> URL: https://svn.open-mpi.org/trac/ompi/changeset/30571
>
> Log:
> OMPI: add call to del_procs
>
> fixed by AlexM, reviewed by miked
> cmr=v1.7.5:reviewer=ompi-rm1.7
>
> Text files modified:
>   trunk/ompi/runtime/ompi_mpi_finalize.c |15 +++
>   1 files changed, 15 insertions(+), 0 deletions(-)
>
> Modified: trunk/ompi/runtime/ompi_mpi_finalize.c
> ==
> --- trunk/ompi/runtime/ompi_mpi_finalize.cWed Feb  5 17:49:26 2014
> (r30570)
> +++ trunk/ompi/runtime/ompi_mpi_finalize.c2014-02-06 03:38:32 EST (Thu, 
> 06 Feb 2014)  (r30571)
> @@ -94,6 +94,9 @@
> opal_list_item_t *item;
> struct timeval ompistart, ompistop;
> ompi_rte_collective_t *coll;
> +ompi_proc_t** procs;
> +size_t nprocs;
> +
>
> /* Be a bit social if an erroneous program calls MPI_FINALIZE in
>two different threads, otherwise we may deadlock in
> @@ -150,6 +153,18 @@
>MPI lifetime, to get better latency when not using TCP */
> opal_progress_event_users_increment();
>
> +
> +if (NULL == (procs = ompi_proc_world())) {
> +return OMPI_ERROR;
> +}
> +
> +if (OMPI_SUCCESS != (ret = MCA_PML_CALL(del_procs(procs, nprocs {
> +free(procs);
> +return ret;
> +}
> +free(procs);
> +
> +
> /* check to see if we want timing information */
> if (ompi_enable_timing != 0 && 0 == OMPI_PROC_MY_NAME->vpid) {
> gettimeofday(, NULL);
> ___
> svn mailing list
> s...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn] svn:open-mpi r30571 - trunk/ompi/runtime

2014-02-06 Thread Ralph Castain
Okay, so let's revert this commit and instead CMR over the one that includes 
the required code.

On Feb 6, 2014, at 9:16 AM, Mike Dubman  wrote:

> It seems that similar code in not in v1.7 tree.
> 
> 
> On Thu, Feb 6, 2014 at 2:40 PM, George Bosilca  wrote:
> This commit is unnecessary. The call to delete_proc is already there, few 
> lines above your own patch. It was introduced on Jan 26 2014 with commit 
> https://svn.open-mpi.org/trac/ompi/changeset/30430.
> 
>   George.
> 
> 
> 
> On Feb 6, 2014, at 09:38 , svn-commit-mai...@open-mpi.org wrote:
> 
> > Author: miked (Mike Dubman)
> > Date: 2014-02-06 03:38:32 EST (Thu, 06 Feb 2014)
> > New Revision: 30571
> > URL: https://svn.open-mpi.org/trac/ompi/changeset/30571
> >
> > Log:
> > OMPI: add call to del_procs
> >
> > fixed by AlexM, reviewed by miked
> > cmr=v1.7.5:reviewer=ompi-rm1.7
> >
> > Text files modified:
> >   trunk/ompi/runtime/ompi_mpi_finalize.c |15 +++
> >   1 files changed, 15 insertions(+), 0 deletions(-)
> >
> > Modified: trunk/ompi/runtime/ompi_mpi_finalize.c
> > ==
> > --- trunk/ompi/runtime/ompi_mpi_finalize.cWed Feb  5 17:49:26 2014  
> >   (r30570)
> > +++ trunk/ompi/runtime/ompi_mpi_finalize.c2014-02-06 03:38:32 EST (Thu, 
> > 06 Feb 2014)  (r30571)
> > @@ -94,6 +94,9 @@
> > opal_list_item_t *item;
> > struct timeval ompistart, ompistop;
> > ompi_rte_collective_t *coll;
> > +ompi_proc_t** procs;
> > +size_t nprocs;
> > +
> >
> > /* Be a bit social if an erroneous program calls MPI_FINALIZE in
> >two different threads, otherwise we may deadlock in
> > @@ -150,6 +153,18 @@
> >MPI lifetime, to get better latency when not using TCP */
> > opal_progress_event_users_increment();
> >
> > +
> > +if (NULL == (procs = ompi_proc_world())) {
> > +return OMPI_ERROR;
> > +}
> > +
> > +if (OMPI_SUCCESS != (ret = MCA_PML_CALL(del_procs(procs, nprocs {
> > +free(procs);
> > +return ret;
> > +}
> > +free(procs);
> > +
> > +
> > /* check to see if we want timing information */
> > if (ompi_enable_timing != 0 && 0 == OMPI_PROC_MY_NAME->vpid) {
> > gettimeofday(, NULL);
> > ___
> > svn mailing list
> > s...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/svn
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn] svn:open-mpi r30571 - trunk/ompi/runtime

2014-02-06 Thread Mike Dubman
It seems that similar code in not in v1.7 tree.


On Thu, Feb 6, 2014 at 2:40 PM, George Bosilca  wrote:

> This commit is unnecessary. The call to delete_proc is already there, few
> lines above your own patch. It was introduced on Jan 26 2014 with commit
> https://svn.open-mpi.org/trac/ompi/changeset/30430.
>
>   George.
>
>
>
> On Feb 6, 2014, at 09:38 , svn-commit-mai...@open-mpi.org wrote:
>
> > Author: miked (Mike Dubman)
> > Date: 2014-02-06 03:38:32 EST (Thu, 06 Feb 2014)
> > New Revision: 30571
> > URL: https://svn.open-mpi.org/trac/ompi/changeset/30571
> >
> > Log:
> > OMPI: add call to del_procs
> >
> > fixed by AlexM, reviewed by miked
> > cmr=v1.7.5:reviewer=ompi-rm1.7
> >
> > Text files modified:
> >   trunk/ompi/runtime/ompi_mpi_finalize.c |15 +++
> >   1 files changed, 15 insertions(+), 0 deletions(-)
> >
> > Modified: trunk/ompi/runtime/ompi_mpi_finalize.c
> >
> ==
> > --- trunk/ompi/runtime/ompi_mpi_finalize.cWed Feb  5 17:49:26 2014
>  (r30570)
> > +++ trunk/ompi/runtime/ompi_mpi_finalize.c2014-02-06 03:38:32 EST
> (Thu, 06 Feb 2014)  (r30571)
> > @@ -94,6 +94,9 @@
> > opal_list_item_t *item;
> > struct timeval ompistart, ompistop;
> > ompi_rte_collective_t *coll;
> > +ompi_proc_t** procs;
> > +size_t nprocs;
> > +
> >
> > /* Be a bit social if an erroneous program calls MPI_FINALIZE in
> >two different threads, otherwise we may deadlock in
> > @@ -150,6 +153,18 @@
> >MPI lifetime, to get better latency when not using TCP */
> > opal_progress_event_users_increment();
> >
> > +
> > +if (NULL == (procs = ompi_proc_world())) {
> > +return OMPI_ERROR;
> > +}
> > +
> > +if (OMPI_SUCCESS != (ret = MCA_PML_CALL(del_procs(procs, nprocs
> {
> > +free(procs);
> > +return ret;
> > +}
> > +free(procs);
> > +
> > +
> > /* check to see if we want timing information */
> > if (ompi_enable_timing != 0 && 0 == OMPI_PROC_MY_NAME->vpid) {
> > gettimeofday(, NULL);
> > ___
> > svn mailing list
> > s...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/svn
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] [OMPI svn] svn:open-mpi r30571 - trunk/ompi/runtime

2014-02-06 Thread Mike Dubman
Thanks
we ported it from internal 1.7.x tree where I think it is not present.
we will check it


On Thu, Feb 6, 2014 at 2:40 PM, George Bosilca  wrote:

> This commit is unnecessary. The call to delete_proc is already there, few
> lines above your own patch. It was introduced on Jan 26 2014 with commit
> https://svn.open-mpi.org/trac/ompi/changeset/30430.
>
>   George.
>
>
>
> On Feb 6, 2014, at 09:38 , svn-commit-mai...@open-mpi.org wrote:
>
> > Author: miked (Mike Dubman)
> > Date: 2014-02-06 03:38:32 EST (Thu, 06 Feb 2014)
> > New Revision: 30571
> > URL: https://svn.open-mpi.org/trac/ompi/changeset/30571
> >
> > Log:
> > OMPI: add call to del_procs
> >
> > fixed by AlexM, reviewed by miked
> > cmr=v1.7.5:reviewer=ompi-rm1.7
> >
> > Text files modified:
> >   trunk/ompi/runtime/ompi_mpi_finalize.c |15 +++
> >   1 files changed, 15 insertions(+), 0 deletions(-)
> >
> > Modified: trunk/ompi/runtime/ompi_mpi_finalize.c
> >
> ==
> > --- trunk/ompi/runtime/ompi_mpi_finalize.cWed Feb  5 17:49:26 2014
>  (r30570)
> > +++ trunk/ompi/runtime/ompi_mpi_finalize.c2014-02-06 03:38:32 EST
> (Thu, 06 Feb 2014)  (r30571)
> > @@ -94,6 +94,9 @@
> > opal_list_item_t *item;
> > struct timeval ompistart, ompistop;
> > ompi_rte_collective_t *coll;
> > +ompi_proc_t** procs;
> > +size_t nprocs;
> > +
> >
> > /* Be a bit social if an erroneous program calls MPI_FINALIZE in
> >two different threads, otherwise we may deadlock in
> > @@ -150,6 +153,18 @@
> >MPI lifetime, to get better latency when not using TCP */
> > opal_progress_event_users_increment();
> >
> > +
> > +if (NULL == (procs = ompi_proc_world())) {
> > +return OMPI_ERROR;
> > +}
> > +
> > +if (OMPI_SUCCESS != (ret = MCA_PML_CALL(del_procs(procs, nprocs
> {
> > +free(procs);
> > +return ret;
> > +}
> > +free(procs);
> > +
> > +
> > /* check to see if we want timing information */
> > if (ompi_enable_timing != 0 && 0 == OMPI_PROC_MY_NAME->vpid) {
> > gettimeofday(, NULL);
> > ___
> > svn mailing list
> > s...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/svn
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
When I initially made the C/R code compile again I made following
change:

diff --git a/orte/mca/rml/oob/rml_oob_component.c 
b/orte/mca/rml/oob/rml_oob_component.c
index f0b22fc..90ed086 100644
--- a/orte/mca/rml/oob/rml_oob_component.c
+++ b/orte/mca/rml/oob/rml_oob_component.c
@@ -185,8 +185,7 @@ orte_rml_oob_ft_event(int state) {
 ;
 }

-if( ORTE_SUCCESS != 
-(ret = orte_oob.ft_event(state)) ) {
+if( ORTE_SUCCESS != (ret = orte_rml_oob_ft_event(state)) ) {
 ORTE_ERROR_LOG(ret);
 exit_status = ret;
 goto cleanup;



This is, of course, wrong. Now the function calls itself in a loop until
it crashes. Looking at orte/mca/oob there is still a ft_event()
function, but it is disabled using "#if 0". Looking at other functions
it seems I would need to create something like

#define ORTE_OOB_FT_EVENT(m)

Looking at the modules in orte/mca/oob/ it seems ft_event is implemented
in some places but it never seems to have any real functionality. Is
ft_event() actually needed there?

Adrian


Re: [OMPI devel] [OMPI svn] svn:open-mpi r30571 - trunk/ompi/runtime

2014-02-06 Thread George Bosilca
This commit is unnecessary. The call to delete_proc is already there, few lines 
above your own patch. It was introduced on Jan 26 2014 with commit 
https://svn.open-mpi.org/trac/ompi/changeset/30430.

  George.



On Feb 6, 2014, at 09:38 , svn-commit-mai...@open-mpi.org wrote:

> Author: miked (Mike Dubman)
> Date: 2014-02-06 03:38:32 EST (Thu, 06 Feb 2014)
> New Revision: 30571
> URL: https://svn.open-mpi.org/trac/ompi/changeset/30571
> 
> Log:
> OMPI: add call to del_procs
> 
> fixed by AlexM, reviewed by miked
> cmr=v1.7.5:reviewer=ompi-rm1.7
> 
> Text files modified: 
>   trunk/ompi/runtime/ompi_mpi_finalize.c |15 +++  
>
>   1 files changed, 15 insertions(+), 0 deletions(-)
> 
> Modified: trunk/ompi/runtime/ompi_mpi_finalize.c
> ==
> --- trunk/ompi/runtime/ompi_mpi_finalize.cWed Feb  5 17:49:26 2014
> (r30570)
> +++ trunk/ompi/runtime/ompi_mpi_finalize.c2014-02-06 03:38:32 EST (Thu, 
> 06 Feb 2014)  (r30571)
> @@ -94,6 +94,9 @@
> opal_list_item_t *item;
> struct timeval ompistart, ompistop;
> ompi_rte_collective_t *coll;
> +ompi_proc_t** procs;
> +size_t nprocs;
> +
> 
> /* Be a bit social if an erroneous program calls MPI_FINALIZE in
>two different threads, otherwise we may deadlock in
> @@ -150,6 +153,18 @@
>MPI lifetime, to get better latency when not using TCP */
> opal_progress_event_users_increment();
> 
> +
> +if (NULL == (procs = ompi_proc_world())) {
> +return OMPI_ERROR;
> +}
> +
> +if (OMPI_SUCCESS != (ret = MCA_PML_CALL(del_procs(procs, nprocs {
> +free(procs);
> +return ret;
> +}
> +free(procs);
> +
> +
> /* check to see if we want timing information */
> if (ompi_enable_timing != 0 && 0 == OMPI_PROC_MY_NAME->vpid) {
> gettimeofday(, NULL);
> ___
> svn mailing list
> s...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn