Re: [OMPI devel] Open MPI (not quite) on Cray XC30

2013-01-22 Thread Ralph Castain
I went ahead and removed the duplicate code, so this should work now. The 
problem is that we re-factored the ompi_info/orte-info code, but didn't 
complete the job - specifically, the orte-info tool didn't get updated. It's 
about to get revamped yet again when the ompi-rte branch gets committed to the 
trunk, so I'd rather not do any more with it now.

Hopefully, this will be the minimum required.


On Jan 22, 2013, at 4:20 PM, Paul Hargrove  wrote:

> I am using the openmpi-1.9a1r27886 tarball and I still see an error for one 
> of the two duplicate symbols:
> 
>   CCLD orte-info
> ../../../orte/.libs/libopen-rte.a(orte_info_support.o): In function 
> `orte_info_show_orte_version':
> ../../orte/runtime/orte_info_support.c:(.text+0xe10): multiple definition of 
> `orte_info_show_orte_version'
> version.o:../../../../orte/tools/orte-info/version.c:(.text+0x2370): first 
> defined here
> 
> -Paul
> 
> 
> On Fri, Jan 18, 2013 at 3:52 AM, George Bosilca  wrote:
> Luckily for us all the definitions contain the same constant (orte). r27864 
> should fix this.
> 
>   George.
> 
> 
> On Jan 18, 2013, at 06:21 , Paul Hargrove  wrote:
> 
>> My employer has a nice new Cray XC30 (aka Cascade), and I thought I'd give 
>> Open MPI a quick test.
>> 
>> Given that it is INTENDED to be API-compatible with the XE series, I began 
>> configuring with
>> CC=cc CXX=CC FC=ftn --with-platform=lanl/cray_xe6/optimized-nopanasas
>> However, since this is Intel h/w, I commented-out the following 2 lines in 
>> the platform file:
>> with_wrapper_cflags="-march=amdfam10"
>> CFLAGS=-march=amdfam10
>> 
>> I am using PrgEnv-gnu/5.0.15, though PrgEnv-intel is the default on our 
>> system
>> 
>> As far as I know, use of 1.6.x is out - no ugni at all, right?
>> So, I didn't even try.
>> 
>> I gave openmpi-1.7rc6 a try, but the ALPS headers and libs have moved (as 
>> mentioned in ompi-trunk/config/orte_check_alps.m4).
>> Perhaps one should CMR the updated-for-CLE-5 configure logic to the 1.7 
>> branch?
>> 
>> Next, I tried a trunk nightly tarball: openmpi-1.9a1r27862.tar.bz2
>> As I mentioned above, the trunk has the right logic for locating ALPS.
>> However, it looks like there is some untested code, protected by "#if 
>> WANT_CRAY_PMI2_EXT", that needs work:
>> 
>> make[2]: Entering directory 
>> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/orte/mca/db/pmi'
>>   CC   db_pmi_component.lo
>>   CC   db_pmi.lo
>> ../../../../../orte/mca/db/pmi/db_pmi.c: In function 'store':
>> ../../../../../orte/mca/db/pmi/db_pmi.c:202: error: 'ptr' undeclared (first 
>> use in this function)
>> ../../../../../orte/mca/db/pmi/db_pmi.c:202: error: (Each undeclared 
>> identifier is reported only once
>> ../../../../../orte/mca/db/pmi/db_pmi.c:202: error: for each function it 
>> appears in.)
>> make[2]: *** [db_pmi.lo] Error 1
>> make[2]: Leaving directory 
>> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/orte/mca/db/pmi'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory 
>> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/orte'
>> make: *** [all-recursive] Error 1
>> 
>> I added the missing "char *ptr" declaration a few lines before it's first 
>> use, and resumed the build.
>> This time the build terminated at
>> 
>> make[2]: Entering directory 
>> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/opal/tools/wrappers'
>>   CC   opal_wrapper.o
>>   CCLD opal_wrapper
>> /usr/bin/ld: attempted static link of dynamic object 
>> `../../../opal/.libs/libopen-pal.so'
>> collect2: error: ld returned 1 exit status
>> 
>> So I went back to the platform file and changed
>>enable_shared=yes
>> to
>>enable_shared=no
>> No big deal there - I had to make the same change for our XE6.
>> 
>> And so I started back at configure (after a "make distclean", to be safe), 
>> and here is the next error:
>> 
>> Making all in tools/orte-info
>> make[2]: Entering directory 
>> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/orte/tools/orte-info'
>>   CCLD orte-info
>> ../../../orte/.libs/libopen-rte.a(orte_info_support.o): In function 
>> `orte_info_show_orte_version':
>> orte_info_support.c:(.text+0xd70): multiple definition of 
>> `orte_info_show_orte_version'
>> version.o:version.c:(.text+0x4b0): first defined here
>> ../../../orte/.libs/libopen-rte.a(orte_info_support.o):(.data+0x0): multiple 
>> definition of `orte_info_type_orte'
>> orte-info.o:(.data+0x10): first defined here
>> /usr/bin/ld: link errors found, deleting executable `orte-info'
>> collect2: error: ld returned 1 exit status
>> make[2]: *** [orte-info] Error 1
>> 
>> I am not sure how to fix this, but I would guess this is probably a simple 
>> fix for somebody who knows OMPI's build infrastructure better than I.
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies 

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27880 - trunk/ompi/request

2013-01-22 Thread Kawashima, Takahiro
George,

I reported the bug three months ago.
Your commit r27880 resolved one of the bugs reported by me,
in another approach.

  http://www.open-mpi.org/community/lists/devel/2012/10/11555.php

But other bugs are still open.

"(1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE."
in my previous mail is not fixed yet. This can be fixed by my patch
(ompi/mpi/c/wait.c and ompi/request/request.c part only) attached
in my another mail.

  http://www.open-mpi.org/community/lists/devel/2012/10/11561.php

"(2) MPI_Status for an inactive request must be an empty status."
in my previous mail is partially fixed. MPI_Wait is fixed by your
r27880. But MPI_Waitall and MPI_Testall should be fixed.
Codes similar to your r27880 should be inserted to
ompi_request_default_wait_all and ompi_request_default_test_all.

You can confirm the fixes by the test program status.c attached in
my previous mail. Run with -n 2. 

  http://www.open-mpi.org/community/lists/devel/2012/10/11555.php

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> To be honest it was hanging in one of my repos for some time. If I'm not 
> mistaken it is somehow related to one active ticket (but I couldn't find the 
> info). It might be good to push it upstream.
> 
>   George.
> 
> On Jan 22, 2013, at 16:27 , "Jeff Squyres (jsquyres)"  
> wrote:
> 
> > George --
> > 
> > Is there any reason not to CMR this to v1.6 and v1.7?
> > 
> > 
> > On Jan 21, 2013, at 6:35 AM, svn-commit-mai...@open-mpi.org wrote:
> > 
> >> Author: bosilca (George Bosilca)
> >> Date: 2013-01-21 06:35:42 EST (Mon, 21 Jan 2013)
> >> New Revision: 27880
> >> URL: https://svn.open-mpi.org/trac/ompi/changeset/27880
> >> 
> >> Log:
> >> My understanding is that an MPI_WAIT() on an inactive request should
> >> return the empty status (MPI 3.0 page 52 line 46).
> >> 
> >> Text files modified: 
> >>  trunk/ompi/request/req_wait.c | 3 +++ 
> >> 
> >>  1 files changed, 3 insertions(+), 0 deletions(-)
> >> 
> >> Modified: trunk/ompi/request/req_wait.c
> >> ==
> >> --- trunk/ompi/request/req_wait.c  Sat Jan 19 19:33:42 2013(r27879)
> >> +++ trunk/ompi/request/req_wait.c  2013-01-21 06:35:42 EST (Mon, 21 Jan 
> >> 2013)  (r27880)
> >> @@ -61,6 +61,9 @@
> >>}
> >>if( req->req_persistent ) {
> >>if( req->req_state == OMPI_REQUEST_INACTIVE ) {
> >> +if (MPI_STATUS_IGNORE != status) {
> >> +*status = ompi_status_empty;
> >> +}
> >>return OMPI_SUCCESS;
> >>}
> >>req->req_state = OMPI_REQUEST_INACTIVE;


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27880 - trunk/ompi/request

2013-01-22 Thread George Bosilca
To be honest it was hanging in one of my repos for some time. If I'm not 
mistaken it is somehow related to one active ticket (but I couldn't find the 
info). It might be good to push it upstream.

  George.

On Jan 22, 2013, at 16:27 , "Jeff Squyres (jsquyres)"  
wrote:

> George --
> 
> Is there any reason not to CMR this to v1.6 and v1.7?
> 
> 
> On Jan 21, 2013, at 6:35 AM, svn-commit-mai...@open-mpi.org wrote:
> 
>> Author: bosilca (George Bosilca)
>> Date: 2013-01-21 06:35:42 EST (Mon, 21 Jan 2013)
>> New Revision: 27880
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27880
>> 
>> Log:
>> My understanding is that an MPI_WAIT() on an inactive request should
>> return the empty status (MPI 3.0 page 52 line 46).
>> 
>> Text files modified: 
>>  trunk/ompi/request/req_wait.c | 3 +++   
>>   
>>  1 files changed, 3 insertions(+), 0 deletions(-)
>> 
>> Modified: trunk/ompi/request/req_wait.c
>> ==
>> --- trunk/ompi/request/req_wait.cSat Jan 19 19:33:42 2013(r27879)
>> +++ trunk/ompi/request/req_wait.c2013-01-21 06:35:42 EST (Mon, 21 Jan 
>> 2013)  (r27880)
>> @@ -61,6 +61,9 @@
>>}
>>if( req->req_persistent ) {
>>if( req->req_state == OMPI_REQUEST_INACTIVE ) {
>> +if (MPI_STATUS_IGNORE != status) {
>> +*status = ompi_status_empty;
>> +}
>>return OMPI_SUCCESS;
>>}
>>req->req_state = OMPI_REQUEST_INACTIVE;
>> ___
>> svn-full mailing list
>> svn-f...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27881 - trunk/ompi/mca/btl/tcp

2013-01-22 Thread George Bosilca
Nobody cared about error cases so far, I don't personally see any incentive to 
push this patch in the 1.7 right now. But I won't be against as it is not 
hurting either.

  George.


On Jan 22, 2013, at 16:28 , "Jeff Squyres (jsquyres)"  
wrote:

> George --
> 
> Similar question on this one: should it be CMR'ed to v1.7?  (I kinda doubt 
> it's appropriate for v1.6)
> 
> 
> On Jan 21, 2013, at 6:41 AM, svn-commit-mai...@open-mpi.org wrote:
> 
>> Author: bosilca (George Bosilca)
>> Date: 2013-01-21 06:41:08 EST (Mon, 21 Jan 2013)
>> New Revision: 27881
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27881
>> 
>> Log:
>> Make the TCP BTL really fail-safe. It now trigger the error callback on
>> all pending fragments when the destination goes down. This allows the PML
>> to recalibrate its behavior, either find an alternate route or just give up.
>> 
>> Text files modified: 
>>  trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c |29 
>> +++--   
>>  trunk/ompi/mca/btl/tcp/btl_tcp_frag.c | 7 ++-   
>>   
>>  trunk/ompi/mca/btl/tcp/btl_tcp_proc.c | 2 +-
>>   
>>  3 files changed, 34 insertions(+), 4 deletions(-)
>> 
>> Modified: trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c
>> ==
>> --- trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.cMon Jan 21 06:35:42 
>> 2013(r27880)
>> +++ trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c2013-01-21 06:41:08 EST 
>> (Mon, 21 Jan 2013)  (r27881)
>> @@ -2,7 +2,7 @@
>> * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>> * University Research and Technology
>> * Corporation.  All rights reserved.
>> - * Copyright (c) 2004-2008 The University of Tennessee and The University
>> + * Copyright (c) 2004-2013 The University of Tennessee and The University
>> * of Tennessee Research Foundation.  All rights
>> * reserved.
>> * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
>> @@ -295,6 +295,7 @@
>>if(opal_socket_errno != EINTR && opal_socket_errno != EAGAIN && 
>> opal_socket_errno != EWOULDBLOCK) {
>>BTL_ERROR(("send() failed: %s (%d)",
>>   strerror(opal_socket_errno), opal_socket_errno));
>> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
>>mca_btl_tcp_endpoint_close(btl_endpoint);
>>return -1;
>>}
>> @@ -359,6 +360,7 @@
>>mca_btl_tcp_endpoint_close(btl_endpoint);
>>btl_endpoint->endpoint_sd = sd;
>>if(mca_btl_tcp_endpoint_send_connect_ack(btl_endpoint) != 
>> OMPI_SUCCESS) {
>> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
>>mca_btl_tcp_endpoint_close(btl_endpoint);
>>OPAL_THREAD_UNLOCK(_endpoint->endpoint_send_lock);
>>OPAL_THREAD_UNLOCK(_endpoint->endpoint_recv_lock);
>> @@ -389,7 +391,6 @@
>> {
>>if(btl_endpoint->endpoint_sd < 0)
>>return;
>> -btl_endpoint->endpoint_state = MCA_BTL_TCP_CLOSED;
>>btl_endpoint->endpoint_retries++;
>>opal_event_del(_endpoint->endpoint_recv_event);
>>opal_event_del(_endpoint->endpoint_send_event);
>> @@ -401,6 +402,24 @@
>>btl_endpoint->endpoint_cache_pos= NULL;
>>btl_endpoint->endpoint_cache_length = 0;
>> #endif  /* MCA_BTL_TCP_ENDPOINT_CACHE */
>> +/**
>> + * If we keep failing to connect to the peer let the caller know about
>> + * this situation by triggering all the pending fragments callback and
>> + * reporting the error.
>> + */
>> +if( MCA_BTL_TCP_FAILED == btl_endpoint->endpoint_state ) {
>> +mca_btl_tcp_frag_t* frag = btl_endpoint->endpoint_send_frag;
>> +if( NULL == frag ) 
>> +frag = 
>> (mca_btl_tcp_frag_t*)opal_list_remove_first(_endpoint->endpoint_frags);
>> +while(NULL != frag) {
>> +frag->base.des_cbfunc(>btl->super, frag->endpoint, 
>> >base, OMPI_ERR_UNREACH);
>> +
>> +frag = 
>> (mca_btl_tcp_frag_t*)opal_list_remove_first(_endpoint->endpoint_frags);
>> +}
>> +} else {
>> +btl_endpoint->endpoint_state = MCA_BTL_TCP_CLOSED;
>> +}
>> +
>> }
>> 
>> /*
>> @@ -444,6 +463,7 @@
>> 
>>/* remote closed connection */
>>if(retval == 0) {
>> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
>>mca_btl_tcp_endpoint_close(btl_endpoint);
>>return -1;
>>}
>> @@ -453,6 +473,7 @@
>>if(opal_socket_errno != EINTR && opal_socket_errno != EAGAIN && 
>> opal_socket_errno != EWOULDBLOCK) {
>>BTL_ERROR(("recv(%d) failed: %s (%d)",
>>   btl_endpoint->endpoint_sd, 
>> strerror(opal_socket_errno), opal_socket_errno));
>> +

Re: [OMPI devel] Open MPI (not quite) on Cray XC30

2013-01-22 Thread Paul Hargrove
I am using the openmpi-1.9a1r27886 tarball and I still see an error for one
of the two duplicate symbols:

  CCLD orte-info
../../../orte/.libs/libopen-rte.a(orte_info_support.o): In function
`orte_info_show_orte_version':
../../orte/runtime/orte_info_support.c:(.text+0xe10): multiple definition
of `orte_info_show_orte_version'
version.o:../../../../orte/tools/orte-info/version.c:(.text+0x2370): first
defined here

-Paul


On Fri, Jan 18, 2013 at 3:52 AM, George Bosilca  wrote:

> Luckily for us all the definitions contain the same constant (orte).
> r27864 should fix this.
>
>   George.
>
>
> On Jan 18, 2013, at 06:21 , Paul Hargrove  wrote:
>
> My employer has a nice new Cray XC30 (aka Cascade), and I thought I'd give
> Open MPI a quick test.
>
> Given that it is INTENDED to be API-compatible with the XE series, I began
> configuring with
> CC=cc CXX=CC FC=ftn --with-platform=lanl/cray_xe6/optimized-nopanasas
> However, since this is Intel h/w, I commented-out the following 2 lines in
> the platform file:
> with_wrapper_cflags="-march=amdfam10"
> CFLAGS=-march=amdfam10
>
> I am using PrgEnv-gnu/5.0.15, though PrgEnv-intel is the default on our
> system
>
> As far as I know, use of 1.6.x is out - no ugni at all, right?
> So, I didn't even try.
>
> I gave openmpi-1.7rc6 a try, but the ALPS headers and libs have moved (as
> mentioned in ompi-trunk/config/orte_check_alps.m4).
> Perhaps one should CMR the updated-for-CLE-5 configure logic to the 1.7
> branch?
>
> Next, I tried a trunk nightly tarball: openmpi-1.9a1r27862.tar.bz2
> As I mentioned above, the trunk has the right logic for locating ALPS.
> However, it looks like there is some untested code, protected by "#if
> WANT_CRAY_PMI2_EXT", that needs work:
>
> make[2]: Entering directory
> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/orte/mca/db/pmi'
>   CC   db_pmi_component.lo
>   CC   db_pmi.lo
> ../../../../../orte/mca/db/pmi/db_pmi.c: In function 'store':
> ../../../../../orte/mca/db/pmi/db_pmi.c:202: error: 'ptr' undeclared
> (first use in this function)
> ../../../../../orte/mca/db/pmi/db_pmi.c:202: error: (Each undeclared
> identifier is reported only once
> ../../../../../orte/mca/db/pmi/db_pmi.c:202: error: for each function it
> appears in.)
> make[2]: *** [db_pmi.lo] Error 1
> make[2]: Leaving directory
> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/orte/mca/db/pmi'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/orte'
> make: *** [all-recursive] Error 1
>
> I added the missing "char *ptr" declaration a few lines before it's first
> use, and resumed the build.
> This time the build terminated at
>
> make[2]: Entering directory
> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/opal/tools/wrappers'
>   CC   opal_wrapper.o
>   CCLD opal_wrapper
> /usr/bin/ld: attempted static link of dynamic object
> `../../../opal/.libs/libopen-pal.so'
> collect2: error: ld returned 1 exit status
>
> So I went back to the platform file and changed
>enable_shared=yes
> to
>enable_shared=no
> No big deal there - I had to make the same change for our XE6.
>
> And so I started back at configure (after a "make distclean", to be safe),
> and here is the next error:
>
> Making all in tools/orte-info
> make[2]: Entering directory
> `/global/scratch/sd/hargrove/OMPI/openmpi-1.9a1r27862/BUILD/orte/tools/orte-info'
>   CCLD orte-info
> ../../../orte/.libs/libopen-rte.a(orte_info_support.o): In function
> `orte_info_show_orte_version':
> orte_info_support.c:(.text+0xd70): multiple definition of
> `orte_info_show_orte_version'
> version.o:version.c:(.text+0x4b0): first defined here
> ../../../orte/.libs/libopen-rte.a(orte_info_support.o):(.data+0x0):
> multiple definition of `orte_info_type_orte'
> orte-info.o:(.data+0x10): first defined here
> /usr/bin/ld: link errors found, deleting executable `orte-info'
> collect2: error: ld returned 1 exit status
> make[2]: *** [orte-info] Error 1
>
> I am not sure how to fix this, but I would guess this is probably a simple
> fix for somebody who knows OMPI's build infrastructure better than I.
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: 

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27881 - trunk/ompi/mca/btl/tcp

2013-01-22 Thread Jeff Squyres (jsquyres)
George --

Similar question on this one: should it be CMR'ed to v1.7?  (I kinda doubt it's 
appropriate for v1.6)


On Jan 21, 2013, at 6:41 AM, svn-commit-mai...@open-mpi.org wrote:

> Author: bosilca (George Bosilca)
> Date: 2013-01-21 06:41:08 EST (Mon, 21 Jan 2013)
> New Revision: 27881
> URL: https://svn.open-mpi.org/trac/ompi/changeset/27881
> 
> Log:
> Make the TCP BTL really fail-safe. It now trigger the error callback on
> all pending fragments when the destination goes down. This allows the PML
> to recalibrate its behavior, either find an alternate route or just give up.
> 
> Text files modified: 
>   trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c |29 
> +++--   
>   trunk/ompi/mca/btl/tcp/btl_tcp_frag.c | 7 ++-   
>   
>   trunk/ompi/mca/btl/tcp/btl_tcp_proc.c | 2 +-
>   
>   3 files changed, 34 insertions(+), 4 deletions(-)
> 
> Modified: trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c
> ==
> --- trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c Mon Jan 21 06:35:42 2013
> (r27880)
> +++ trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c 2013-01-21 06:41:08 EST (Mon, 
> 21 Jan 2013)  (r27881)
> @@ -2,7 +2,7 @@
>  * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>  * University Research and Technology
>  * Corporation.  All rights reserved.
> - * Copyright (c) 2004-2008 The University of Tennessee and The University
> + * Copyright (c) 2004-2013 The University of Tennessee and The University
>  * of Tennessee Research Foundation.  All rights
>  * reserved.
>  * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
> @@ -295,6 +295,7 @@
> if(opal_socket_errno != EINTR && opal_socket_errno != EAGAIN && 
> opal_socket_errno != EWOULDBLOCK) {
> BTL_ERROR(("send() failed: %s (%d)",
>strerror(opal_socket_errno), opal_socket_errno));
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
> mca_btl_tcp_endpoint_close(btl_endpoint);
> return -1;
> }
> @@ -359,6 +360,7 @@
> mca_btl_tcp_endpoint_close(btl_endpoint);
> btl_endpoint->endpoint_sd = sd;
> if(mca_btl_tcp_endpoint_send_connect_ack(btl_endpoint) != 
> OMPI_SUCCESS) {
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
> mca_btl_tcp_endpoint_close(btl_endpoint);
> OPAL_THREAD_UNLOCK(_endpoint->endpoint_send_lock);
> OPAL_THREAD_UNLOCK(_endpoint->endpoint_recv_lock);
> @@ -389,7 +391,6 @@
> {
> if(btl_endpoint->endpoint_sd < 0)
> return;
> -btl_endpoint->endpoint_state = MCA_BTL_TCP_CLOSED;
> btl_endpoint->endpoint_retries++;
> opal_event_del(_endpoint->endpoint_recv_event);
> opal_event_del(_endpoint->endpoint_send_event);
> @@ -401,6 +402,24 @@
> btl_endpoint->endpoint_cache_pos= NULL;
> btl_endpoint->endpoint_cache_length = 0;
> #endif  /* MCA_BTL_TCP_ENDPOINT_CACHE */
> +/**
> + * If we keep failing to connect to the peer let the caller know about
> + * this situation by triggering all the pending fragments callback and
> + * reporting the error.
> + */
> +if( MCA_BTL_TCP_FAILED == btl_endpoint->endpoint_state ) {
> +mca_btl_tcp_frag_t* frag = btl_endpoint->endpoint_send_frag;
> +if( NULL == frag ) 
> +frag = 
> (mca_btl_tcp_frag_t*)opal_list_remove_first(_endpoint->endpoint_frags);
> +while(NULL != frag) {
> +frag->base.des_cbfunc(>btl->super, frag->endpoint, 
> >base, OMPI_ERR_UNREACH);
> +
> +frag = 
> (mca_btl_tcp_frag_t*)opal_list_remove_first(_endpoint->endpoint_frags);
> +}
> +} else {
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_CLOSED;
> +}
> +
> }
> 
> /*
> @@ -444,6 +463,7 @@
> 
> /* remote closed connection */
> if(retval == 0) {
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
> mca_btl_tcp_endpoint_close(btl_endpoint);
> return -1;
> }
> @@ -453,6 +473,7 @@
> if(opal_socket_errno != EINTR && opal_socket_errno != EAGAIN && 
> opal_socket_errno != EWOULDBLOCK) {
> BTL_ERROR(("recv(%d) failed: %s (%d)",
>btl_endpoint->endpoint_sd, 
> strerror(opal_socket_errno), opal_socket_errno));
> +btl_endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
> mca_btl_tcp_endpoint_close(btl_endpoint);
> return -1;
> }
> @@ -589,6 +610,7 @@
> address,
>btl_endpoint->endpoint_addr->addr_port, 
> strerror(opal_socket_errno) ) );
> }
> +btl_endpoint->endpoint_state = 

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27880 - trunk/ompi/request

2013-01-22 Thread Jeff Squyres (jsquyres)
George --

Is there any reason not to CMR this to v1.6 and v1.7?


On Jan 21, 2013, at 6:35 AM, svn-commit-mai...@open-mpi.org wrote:

> Author: bosilca (George Bosilca)
> Date: 2013-01-21 06:35:42 EST (Mon, 21 Jan 2013)
> New Revision: 27880
> URL: https://svn.open-mpi.org/trac/ompi/changeset/27880
> 
> Log:
> My understanding is that an MPI_WAIT() on an inactive request should
> return the empty status (MPI 3.0 page 52 line 46).
> 
> Text files modified: 
>   trunk/ompi/request/req_wait.c | 3 +++   
>   
>   1 files changed, 3 insertions(+), 0 deletions(-)
> 
> Modified: trunk/ompi/request/req_wait.c
> ==
> --- trunk/ompi/request/req_wait.c Sat Jan 19 19:33:42 2013(r27879)
> +++ trunk/ompi/request/req_wait.c 2013-01-21 06:35:42 EST (Mon, 21 Jan 
> 2013)  (r27880)
> @@ -61,6 +61,9 @@
> }
> if( req->req_persistent ) {
> if( req->req_state == OMPI_REQUEST_INACTIVE ) {
> +if (MPI_STATUS_IGNORE != status) {
> +*status = ompi_status_empty;
> +}
> return OMPI_SUCCESS;
> }
> req->req_state = OMPI_REQUEST_INACTIVE;
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] New ARM patch

2013-01-22 Thread Jeff Squyres (jsquyres)
Leif --

We talked about this a bit on our weekly call today. 

Just to be sure: are you saying that George's patches are *functionally 
correct* for ARM5/6/7 (and broken for ARM 4), but it would be better to 
organize the code a bit better?

If that is correct, was ARM4 working before?

If ARM4 was working before, how important is it?  I.e., would it be ok to 
accept George's stuff for 1.7.0, and then accept any 
improvements/reshuffle/etc. from you for 1.7.1?



On Jan 21, 2013, at 12:15 PM, Leif Lindholm  wrote:

> Hi George,
> 
> Any chance of r27882 being reverted?
> 
> As I told the Fedora guys when that patch originally surfaced[1],
> I'm not overly fond of
> - copying source files around as part of the configure step
> - having separate source files for ARMv6 and ARMv7, when those differences
>  should be easily separated through macros (and would be reusable for 32-bit
>  ARMv8).
> 
> Also, I might have mentioned that bit only on a separate thread on the Fedora 
> list, but the ARMv4 support isn't actually correct (the ASM uses ARMv5-only 
> operations).
> 
> My alternate solution, the basic idea of which I posted over there [2] was to 
> separate ARMv5 and earlier from ARM. Effectively separating the atomics 
> implementation at the boundary where The ARM architecture got 
> load-linked/store-conditional, rather than having a separate source file for 
> every architecture version.
> 
> [1] https://lists.fedoraproject.org/pipermail/arm/2012-November/004434.html
> [2] https://lists.fedoraproject.org/pipermail/arm/2012-November/004460.html
> 
> Best Regards,
> 
> Leif
> 
> -- IMPORTANT NOTICE: The contents of this email and any attachments are 
> confidential and may also be privileged. If you are not the intended 
> recipient, please notify the sender immediately and do not disclose the 
> contents to any other person, use it for any purpose, or store or copy the 
> information in any medium.  Thank you.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/