Thanks Paul, Are you invoking mpirun on pcp-j-20 ? If yes, what does getent hosts pcp-j-20 says ?
BTW, did you try without -m64 ? Does the following work ping/ssh 172.18.0.120 Honestly, this output makes very little sense to me, so i am asking way too much info hoping i can reproduce this issue or get a hint on what can possibly goes wrong. Cheers, Gilles Paul Hargrove <phhargr...@lbl.gov> wrote: >Gilles, > > >I am running mpirun on a host that ALSO will run one of the application >processes. > >Requested ifconfig and netstat outputs appear below. > > >-Paul > > >[phargrov@pcp-j-20 ~]$ ifconfig -a > >lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 >index 1 > > inet 127.0.0.1 netmask ff000000 > >bge0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500 index 2 > > inet 172.16.0.120 netmask ffff0000 broadcast 172.16.255.255 > >pFFFF.ibp0: flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU> mtu >2044 index 3 > > inet 172.18.0.120 netmask ffff0000 broadcast 172.18.255.255 > >lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 >index 1 > > inet6 ::1/128 > >bge0: flags=20002004841<UP,RUNNING,MULTICAST,DHCP,IPv6> mtu 1500 index 2 > > inet6 fe80::250:45ff:fe5c:2b0/10 > >[phargrov@pcp-j-20 ~]$ netstat -nr > > >Routing Table: IPv4 > > Destination Gateway Flags Ref Use Interface > >-------------------- -------------------- ----- ----- ---------- --------- > >default 172.16.254.1 UG 2 158463 bge0 > >127.0.0.1 127.0.0.1 UH 5 398913 lo0 > >172.16.0.0 172.16.0.120 U 4 135241319 bge0 > >172.18.0.0 172.18.0.120 U 3 26 pFFFF.ibp0 > > >Routing Table: IPv6 > > Destination/Mask Gateway Flags Ref Use If > > >--------------------------- --------------------------- ----- --- ------- >----- > >::1 ::1 UH 2 0 lo0 > > >fe80::/10 fe80::250:45ff:fe5c:2b0 U 2 0 bge0 > > >On Tue, Dec 16, 2014 at 2:55 AM, Gilles Gouaillardet ><gilles.gouaillar...@iferc.org> wrote: > >Paul, > >could you please send the output of >ifconfig -a >netstat -nr > >on the three hosts you are using >(i assume you are still invoking mpirun from one node, and tasks are running >on two other nodes) > >Cheers, > >Gilles > > > >On 2014/12/16 16:00, Paul Hargrove wrote: > >Gilles, I looked again carefully and I am *NOT* finding -D_REENTRANT passed to >most compilations. It appears to be used for building libevent and vt, but >nothing else. The output from configure contains checking if more special >flags are required for pthreads... -D_REENTRANT only in the libevent and vt >sub-configure portions. When configured for gcc on Solaris-11 I see the >following in configure checking for C optimization flags... -m64 -D_REENTRANT >-g -finline-functions -fno-strict-aliasing but with CC=cc the equivalent line >is checking for C optimization flags... -m64 -g In both cases the "-m64" is >from the CFLAGS I have passed to configure. However, when I use CFLAGS="-m64 >-D_REENTRANT" the problem DOES NOT go away. I see [pcp-j-20:24740] >mca_oob_tcp_accept: accept() failed: Error 0 (11). >------------------------------------------------------------ A process or >daemon was unable to complete a TCP connection to another process: Local host: >pcp-j-20 Remote host: 172.18.0.120 This is usually caused by a firewall on the >remote host. Please check that any firewall (e.g., iptables) has been disabled >and try again. ------------------------------------------------------------ >which is at least appears to have a non-zero errno. A quick grep through >/usr/include/sys/errno shows 11 is EAGAIN. With the oob.patch you provided the >failed accept goes away, BUT the connection still fails: >------------------------------------------------------------ A process or >daemon was unable to complete a TCP connection to another process: Local host: >pcp-j-20 Remote host: 172.18.0.120 This is usually caused by a firewall on the >remote host. Please check that any firewall (e.g., iptables) has been disabled >and try again. ------------------------------------------------------------ >Use of "-mca oob_tcp_if_include bge0" to use a single interface did not fix >this. -Paul On Mon, Dec 15, 2014 at 7:18 PM, Paul Hargrove ><phhargr...@lbl.gov> wrote: > >Gilles, I am NOT seeing the problem with gcc. It is only occurring with the >Studio compilers. As I've already reported, I have tried adding either "-mt" >or "-mt=yes" to both LDFLAGS and --with-wrapper-ldflags. The "cc" manpage (on >the Solaris-10 system I can get to right now) says: -mt Compile and link for >multithreaded code. This option passes -D_REENTRANT to the preprocessor and >passes -lthread in the correct order to ld. The -mt option is required if the >application or libraries are multithreaded. To ensure proper library linking >order, you must use this option, rather than -lthread, to link with lib- >thread. If you are using POSIX threads, you must link with the options -mt >-lpthread. The -mt option is necessary because libC and libCrun need libthread >for a mul- tithreaded application. If you compile and link in separate steps >and you com- pile with -mt, you might get unexpected results. If you compile >one translation unit with -mt, compile all units of the program with -mt. I >cannot connect to my Solaris-11 system right now, but I recall the text to be >quite similar. -Paul On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet < >gilles.gouaillar...@iferc.org> wrote: > >Paul, did you manually set -mt ? if i remember correctly, solaris 11 (at least >with gcc compilers) do not need any flags (except the -D_REENTRANT that is >added automatically) Cheers, Gilles On 2014/12/16 12:10, Paul Hargrove wrote: >Gilles, I will try the patch when I can. However, our network is undergoing >network maintenance right now, leaving me unable to reach the necessary hosts. >As for -D_REENTRANT, I had already reported having verified in the "make" >output that it had been added automatically. Additionally, the docs say that >"-mt" *also* passes -D_REENTRANT to the preprocessor. -Paul On Mon, Dec 15, >2014 at 6:07 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: >Paul, could you please make sure configure added "-D_REENTRANT" to the CFLAGS >? /* otherwise, errno is a global variable instead of a per thread variable, >which can explains some weird behaviour. note this should have been already >fixed */ assuming -D_REENTRANT is set, could you please give the attached >patch a try ? i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the >confusing error message e.g. failed: Error 0 (0) FWIW, master is also >affected. Cheers, Gilles On 2014/12/16 10:47, Paul Hargrove wrote: I have >tried with a oob_tcp_if_include setting so that there is now only 1 interface. >Even with just one interface and -mt=yes in both LDFLAGS and wrapper-ldflags I >*still* getting messages like [pcp-j-20:11470] mca_oob_tcp_accept: accept() >failed: Error 0 (0). ------------------------------ > >------------------------------ A process or daemon was unable to complete a >TCP connection to another process: Local host: pcp-j-20 Remote host: >172.16.0.120 This is usually caused by a firewall on the remote host. Please >check that any firewall (e.g., iptables) has been disabled and try again. >------------------------------ ------------------------------ I am getting >less certain that my speculation about thread-safe libs is correct. -Paul On >Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove <phhargr...@lbl.gov> ><phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote: A little >more reading finds that... Docs says that one needs "-mt" without the "=yes". >That will work for both old and new compilers, where "-mt=yes" chokes older >ones. Also, man pages say "-mt" must come before "-lpthread" in the link >command. -Paul On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove ><phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> ><phhargr...@lbl.gov> wrote: On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain ><r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> ><r...@open-mpi.org> wrote: 7. Linkage issue on Solaris-11 reported by Paul >Hargrove. Missing the multi-threaded C libraries, apparently need "-mt=yes" in >both compile and link. Need someone to investigate. The lack of multi-thread >libraries is my SPECULATION. The fact that configuring with LDFLAGS=-mt=yes >did not help may or may not prove anything. I didn't see them in "mpicc -show" >and so maybe they needed to be in wrapper-ldflags instead. My time this week >is quite limited, but I can "fire an forget" tests of any tarballs you >provide. -Paul -- Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group Computer Science >Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: >+1-510-486-6900 -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & >Systems Software (CLaSS) Group Computer Science Department Tel: >+1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >_______________________________________________ devel mailing >listde...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16607.php >_______________________________________________ devel mailing >listde...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel > >Link to this >post:http://www.open-mpi.org/community/lists/devel/2014/12/16608.php >_______________________________________________ devel mailing >listde...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16610.php >_______________________________________________ devel mailing list >de...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16611.php > >-- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software >(CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence >Berkeley National Laboratory Fax: +1-510-486-6900 > > > >_______________________________________________ devel mailing list >de...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16613.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/12/16615.php > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >