Paul, could you please send the output of ifconfig -a netstat -nr
on the three hosts you are using (i assume you are still invoking mpirun from one node, and tasks are running on two other nodes) Cheers, Gilles On 2014/12/16 16:00, Paul Hargrove wrote: > Gilles, > > I looked again carefully and I am *NOT* finding -D_REENTRANT passed to most > compilations. > It appears to be used for building libevent and vt, but nothing else. > The output from configure contains > > checking if more special flags are required for pthreads... -D_REENTRANT > > only in the libevent and vt sub-configure portions. > > When configured for gcc on Solaris-11 I see the following in configure > > checking for C optimization flags... -m64 -D_REENTRANT -g > -finline-functions -fno-strict-aliasing > > but with CC=cc the equivalent line is > > checking for C optimization flags... -m64 -g > > In both cases the "-m64" is from the CFLAGS I have passed to configure. > > However, when I use CFLAGS="-m64 -D_REENTRANT" the problem DOES NOT go away. > I see > > [pcp-j-20:24740] mca_oob_tcp_accept: accept() failed: Error 0 (11). > ------------------------------------------------------------ > A process or daemon was unable to complete a TCP connection > to another process: > Local host: pcp-j-20 > Remote host: 172.18.0.120 > This is usually caused by a firewall on the remote host. Please > check that any firewall (e.g., iptables) has been disabled and > try again. > ------------------------------------------------------------ > > which is at least appears to have a non-zero errno. > A quick grep through /usr/include/sys/errno shows 11 is EAGAIN. > > With the oob.patch you provided the failed accept goes away, BUT the > connection still fails: > > ------------------------------------------------------------ > A process or daemon was unable to complete a TCP connection > to another process: > Local host: pcp-j-20 > Remote host: 172.18.0.120 > This is usually caused by a firewall on the remote host. Please > check that any firewall (e.g., iptables) has been disabled and > try again. > ------------------------------------------------------------ > > > Use of "-mca oob_tcp_if_include bge0" to use a single interface did not fix > this. > > > -Paul > > On Mon, Dec 15, 2014 at 7:18 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> Gilles, >> >> I am NOT seeing the problem with gcc. >> It is only occurring with the Studio compilers. >> >> As I've already reported, I have tried adding either "-mt" or "-mt=yes" to >> both LDFLAGS and --with-wrapper-ldflags. >> >> The "cc" manpage (on the Solaris-10 system I can get to right now) says: >> >> -mt Compile and link for multithreaded code. >> >> This option passes -D_REENTRANT to the preprocessor and >> passes -lthread in the correct order to ld. >> >> The -mt option is required if the application or >> libraries are multithreaded. >> >> To ensure proper library linking order, you must use >> this option, rather than -lthread, to link with lib- >> thread. >> >> If you are using POSIX threads, you must link with the >> options -mt -lpthread. The -mt option is necessary >> because libC and libCrun need libthread for a mul- >> tithreaded application. >> >> If you compile and link in separate steps and you com- >> pile with -mt, you might get unexpected results. If you >> compile one translation unit with -mt, compile all >> units of the program with -mt. >> >> I cannot connect to my Solaris-11 system right now, but I recall the text >> to be quite similar. >> >> -Paul >> >> On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >>> Paul, >>> >>> did you manually set -mt ? >>> >>> if i remember correctly, solaris 11 (at least with gcc compilers) do not >>> need any flags >>> (except the -D_REENTRANT that is added automatically) >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On 2014/12/16 12:10, Paul Hargrove wrote: >>> >>> Gilles, >>> >>> I will try the patch when I can. >>> However, our network is undergoing network maintenance right now, leaving >>> me unable to reach the necessary hosts. >>> >>> As for -D_REENTRANT, I had already reported having verified in the "make" >>> output that it had been added automatically. >>> >>> Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to the >>> preprocessor. >>> >>> -Paul >>> >>> On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org> wrote: >>> >>> >>> Paul, >>> >>> could you please make sure configure added "-D_REENTRANT" to the CFLAGS ? >>> /* otherwise, errno is a global variable instead of a per thread variable, >>> which can >>> explains some weird behaviour. note this should have been already fixed */ >>> >>> assuming -D_REENTRANT is set, could you please give the attached patch a >>> try ? >>> >>> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the confusing >>> error message >>> e.g. failed: Error 0 (0) >>> >>> FWIW, master is also affected. >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On 2014/12/16 10:47, Paul Hargrove wrote: >>> >>> I have tried with a oob_tcp_if_include setting so that there is now only 1 >>> interface. >>> Even with just one interface and -mt=yes in both LDFLAGS and >>> wrapper-ldflags I *still* getting messages like >>> >>> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0). >>> ------------------------------------------------------------ >>> A process or daemon was unable to complete a TCP connection >>> to another process: >>> Local host: pcp-j-20 >>> Remote host: 172.16.0.120 >>> This is usually caused by a firewall on the remote host. Please >>> check that any firewall (e.g., iptables) has been disabled and >>> try again. >>> ------------------------------ >>> ------------------------------ >>> >>> >>> I am getting less certain that my speculation about thread-safe libs is >>> correct. >>> >>> -Paul >>> >>> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove <phhargr...@lbl.gov> >>> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> wrote: >>> >>> A little more reading finds that... >>> >>> Docs says that one needs "-mt" without the "=yes". >>> That will work for both old and new compilers, where "-mt=yes" chokes >>> older ones. >>> >>> Also, man pages say "-mt" must come before "-lpthread" in the link command. >>> >>> -Paul >>> >>> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove <phhargr...@lbl.gov> >>> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> >>> wrote: >>> >>> >>> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain <r...@open-mpi.org> >>> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> wrote: >>> >>> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the >>> multi-threaded C libraries, apparently need "-mt=yes" in both compile and >>> link. Need someone to investigate. >>> >>> >>> The lack of multi-thread libraries is my SPECULATION. >>> >>> The fact that configuring with LDFLAGS=-mt=yes did not help may or may >>> not prove anything. >>> I didn't see them in "mpicc -show" and so maybe they needed to be in >>> wrapper-ldflags instead. >>> My time this week is quite limited, but I can "fire an forget" tests of >>> any tarballs you provide. >>> >>> -Paul >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> >>> >>> _______________________________________________ >>> devel mailing listde...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php >>> >>> >>> >>> _______________________________________________ >>> devel mailing listde...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this >>> post:http://www.open-mpi.org/community/lists/devel/2014/12/16608.php >>> >>> >>> >>> _______________________________________________ >>> devel mailing listde...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16610.php >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16611.php >>> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16613.php