Thanks Paul,

Are you invoking mpirun on pcp-j-20 ?
If yes, what does
getent hosts pcp-j-20
says ?

BTW, did you try without -m64 ?

Does the following work
ping/ssh 172.18.0.120

Honestly, this output makes very little sense to me, so i am asking way too 
much info hoping i can reproduce this issue or get a hint on what can possibly 
goes wrong.

Cheers,

Gilles

Paul Hargrove <phhargr...@lbl.gov> wrote:
>Gilles,
>
>
>I am running mpirun on a host that ALSO will run one of the application 
>processes.
>
>Requested ifconfig and netstat outputs appear below.
>
>
>-Paul
>
>
>[phargrov@pcp-j-20 ~]$ ifconfig -a
>
>lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 
>index 1
>
>        inet 127.0.0.1 netmask ff000000 
>
>bge0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500 index 2
>
>        inet 172.16.0.120 netmask ffff0000 broadcast 172.16.255.255
>
>pFFFF.ibp0: flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU> mtu 
>2044 index 3
>
>        inet 172.18.0.120 netmask ffff0000 broadcast 172.18.255.255
>
>lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 
>index 1
>
>        inet6 ::1/128 
>
>bge0: flags=20002004841<UP,RUNNING,MULTICAST,DHCP,IPv6> mtu 1500 index 2
>
>        inet6 fe80::250:45ff:fe5c:2b0/10 
>
>[phargrov@pcp-j-20 ~]$ netstat -nr
>
>
>Routing Table: IPv4
>
>  Destination           Gateway           Flags  Ref     Use     Interface 
>
>-------------------- -------------------- ----- ----- ---------- --------- 
>
>default              172.16.254.1         UG        2     158463 bge0      
>
>127.0.0.1            127.0.0.1            UH        5     398913 lo0       
>
>172.16.0.0           172.16.0.120         U         4  135241319 bge0      
>
>172.18.0.0           172.18.0.120         U         3         26 pFFFF.ibp0 
>
>
>Routing Table: IPv6
>
>  Destination/Mask            Gateway                   Flags Ref   Use    If  
> 
>
>--------------------------- --------------------------- ----- --- ------- 
>----- 
>
>::1                         ::1                         UH      2       0 lo0  
> 
>
>fe80::/10                   fe80::250:45ff:fe5c:2b0     U       2       0 bge0 
>
>
>On Tue, Dec 16, 2014 at 2:55 AM, Gilles Gouaillardet 
><gilles.gouaillar...@iferc.org> wrote:
>
>Paul,
>
>could you please send the output of
>ifconfig -a
>netstat -nr
>
>on the three hosts you are using
>(i assume you are still invoking mpirun from one node, and tasks are running 
>on two other nodes)
>
>Cheers,
>
>Gilles
>
>
>
>On 2014/12/16 16:00, Paul Hargrove wrote:
>
>Gilles, I looked again carefully and I am *NOT* finding -D_REENTRANT passed to 
>most compilations. It appears to be used for building libevent and vt, but 
>nothing else. The output from configure contains checking if more special 
>flags are required for pthreads... -D_REENTRANT only in the libevent and vt 
>sub-configure portions. When configured for gcc on Solaris-11 I see the 
>following in configure checking for C optimization flags... -m64 -D_REENTRANT 
>-g -finline-functions -fno-strict-aliasing but with CC=cc the equivalent line 
>is checking for C optimization flags... -m64 -g In both cases the "-m64" is 
>from the CFLAGS I have passed to configure. However, when I use CFLAGS="-m64 
>-D_REENTRANT" the problem DOES NOT go away. I see [pcp-j-20:24740] 
>mca_oob_tcp_accept: accept() failed: Error 0 (11). 
>------------------------------------------------------------ A process or 
>daemon was unable to complete a TCP connection to another process: Local host: 
>pcp-j-20 Remote host: 172.18.0.120 This is usually caused by a firewall on the 
>remote host. Please check that any firewall (e.g., iptables) has been disabled 
>and try again. ------------------------------------------------------------ 
>which is at least appears to have a non-zero errno. A quick grep through 
>/usr/include/sys/errno shows 11 is EAGAIN. With the oob.patch you provided the 
>failed accept goes away, BUT the connection still fails: 
>------------------------------------------------------------ A process or 
>daemon was unable to complete a TCP connection to another process: Local host: 
>pcp-j-20 Remote host: 172.18.0.120 This is usually caused by a firewall on the 
>remote host. Please check that any firewall (e.g., iptables) has been disabled 
>and try again. ------------------------------------------------------------ 
>Use of "-mca oob_tcp_if_include bge0" to use a single interface did not fix 
>this. -Paul On Mon, Dec 15, 2014 at 7:18 PM, Paul Hargrove 
><phhargr...@lbl.gov> wrote: 
>
>Gilles, I am NOT seeing the problem with gcc. It is only occurring with the 
>Studio compilers. As I've already reported, I have tried adding either "-mt" 
>or "-mt=yes" to both LDFLAGS and --with-wrapper-ldflags. The "cc" manpage (on 
>the Solaris-10 system I can get to right now) says: -mt Compile and link for 
>multithreaded code. This option passes -D_REENTRANT to the preprocessor and 
>passes -lthread in the correct order to ld. The -mt option is required if the 
>application or libraries are multithreaded. To ensure proper library linking 
>order, you must use this option, rather than -lthread, to link with lib- 
>thread. If you are using POSIX threads, you must link with the options -mt 
>-lpthread. The -mt option is necessary because libC and libCrun need libthread 
>for a mul- tithreaded application. If you compile and link in separate steps 
>and you com- pile with -mt, you might get unexpected results. If you compile 
>one translation unit with -mt, compile all units of the program with -mt. I 
>cannot connect to my Solaris-11 system right now, but I recall the text to be 
>quite similar. -Paul On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet < 
>gilles.gouaillar...@iferc.org> wrote: 
>
>Paul, did you manually set -mt ? if i remember correctly, solaris 11 (at least 
>with gcc compilers) do not need any flags (except the -D_REENTRANT that is 
>added automatically) Cheers, Gilles On 2014/12/16 12:10, Paul Hargrove wrote: 
>Gilles, I will try the patch when I can. However, our network is undergoing 
>network maintenance right now, leaving me unable to reach the necessary hosts. 
>As for -D_REENTRANT, I had already reported having verified in the "make" 
>output that it had been added automatically. Additionally, the docs say that 
>"-mt" *also* passes -D_REENTRANT to the preprocessor. -Paul On Mon, Dec 15, 
>2014 at 6:07 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: 
>Paul, could you please make sure configure added "-D_REENTRANT" to the CFLAGS 
>? /* otherwise, errno is a global variable instead of a per thread variable, 
>which can explains some weird behaviour. note this should have been already 
>fixed */ assuming -D_REENTRANT is set, could you please give the attached 
>patch a try ? i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the 
>confusing error message e.g. failed: Error 0 (0) FWIW, master is also 
>affected. Cheers, Gilles On 2014/12/16 10:47, Paul Hargrove wrote: I have 
>tried with a oob_tcp_if_include setting so that there is now only 1 interface. 
>Even with just one interface and -mt=yes in both LDFLAGS and wrapper-ldflags I 
>*still* getting messages like [pcp-j-20:11470] mca_oob_tcp_accept: accept() 
>failed: Error 0 (0). ------------------------------
>
>------------------------------ A process or daemon was unable to complete a 
>TCP connection to another process: Local host: pcp-j-20 Remote host: 
>172.16.0.120 This is usually caused by a firewall on the remote host. Please 
>check that any firewall (e.g., iptables) has been disabled and try again. 
>------------------------------ ------------------------------ I am getting 
>less certain that my speculation about thread-safe libs is correct. -Paul On 
>Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove <phhargr...@lbl.gov> 
><phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote: A little 
>more reading finds that... Docs says that one needs "-mt" without the "=yes". 
>That will work for both old and new compilers, where "-mt=yes" chokes older 
>ones. Also, man pages say "-mt" must come before "-lpthread" in the link 
>command. -Paul On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove 
><phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> 
><phhargr...@lbl.gov> wrote: On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain 
><r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> 
><r...@open-mpi.org> wrote: 7. Linkage issue on Solaris-11 reported by Paul 
>Hargrove. Missing the multi-threaded C libraries, apparently need "-mt=yes" in 
>both compile and link. Need someone to investigate. The lack of multi-thread 
>libraries is my SPECULATION. The fact that configuring with LDFLAGS=-mt=yes 
>did not help may or may not prove anything. I didn't see them in "mpicc -show" 
>and so maybe they needed to be in wrapper-ldflags instead. My time this week 
>is quite limited, but I can "fire an forget" tests of any tarballs you 
>provide. -Paul -- Paul H. Hargrove phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group Computer Science 
>Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: 
>+1-510-486-6900 -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & 
>Systems Software (CLaSS) Group Computer Science Department Tel: 
>+1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 
>_______________________________________________ devel mailing 
>listde...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16607.php 
>_______________________________________________ devel mailing 
>listde...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>Link to this 
>post:http://www.open-mpi.org/community/lists/devel/2014/12/16608.php 
>_______________________________________________ devel mailing 
>listde...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16610.php 
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16611.php 
>
>-- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software 
>(CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence 
>Berkeley National Laboratory Fax: +1-510-486-6900 
>
>
>
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16613.php 
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/12/16615.php
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>

Reply via email to