Thanks Paul! Sorry I was out all day - stuck in meetings, I fear.
On Wed, Dec 17, 2014 at 7:17 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Short version: > > v1.8 nightly (v1.8.3-313-g54c80c2) PASSED my testing. > > In full: > > I gave openmpi-v1.8.3-313-g54c80c2 a try. > In this test I did not add -D_REENTRANT or -mt to any flags at configure > time. > In addition to --prefix, I passed the following: > > --enable-debug --with-verbs \ > CC=cc CXX=CC FC=f90 \ > CFLAGS=-m64 --with-wrapper-cflags=-m64 \ > FCFLAGS=-m64 --with-wrapper-fcflags=-m64 \ > CXXFLAGS='-m64 -library=stlport4' --with-wrapper-cxxflags='-m64 > -library=stlport4' > > > So, this was essentially an "out of the box" build with the configure > options needed for the compilers and ABI I desire. > They are the same options I have used successfully with 1.8.3. > So, I believe the regression I had observed relative to 1.8.3 has ben > resolved. > > I am going to run the nightly on other configs on both my > Solaris-11/x86-64 and Solaris-10/SPARC systems. > I just want to be sure some other compile/abi/arch combination didn't get > broken by accident. > I will post my results to the list (probably Thu lunch time in California). > > -Paul > > On Wed, Dec 17, 2014 at 2:54 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: >> >> Paul -- >> >> The __sun macro check is now in the OMPI 1.8 tree, and is in the latest >> nightly tarball. >> >> If I'm following this thread right -- and I might not be! -- I think >> Gilles is saying that now that the __sun check is in, it should fix this >> -mt/-D_REENTRANT/whatever problem. >> >> Can you confirm? >> >> >> On Dec 16, 2014, at 1:55 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> >> > Gilles, >> > >> > I am running mpirun on a host that ALSO will run one of the application >> processes. >> > Requested ifconfig and netstat outputs appear below. >> > >> > -Paul >> > >> > [phargrov@pcp-j-20 ~]$ ifconfig -a >> > lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu >> 8232 index 1 >> > inet 127.0.0.1 netmask ff000000 >> > bge0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500 >> index 2 >> > inet 172.16.0.120 netmask ffff0000 broadcast 172.16.255.255 >> > pFFFF.ibp0: >> flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU> mtu 2044 >> index 3 >> > inet 172.18.0.120 netmask ffff0000 broadcast 172.18.255.255 >> > lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu >> 8252 index 1 >> > inet6 ::1/128 >> > bge0: flags=20002004841<UP,RUNNING,MULTICAST,DHCP,IPv6> mtu 1500 index 2 >> > inet6 fe80::250:45ff:fe5c:2b0/10 >> > [phargrov@pcp-j-20 ~]$ netstat -nr >> > >> > Routing Table: IPv4 >> > Destination Gateway Flags Ref Use >> Interface >> > -------------------- -------------------- ----- ----- ---------- >> --------- >> > default 172.16.254.1 UG 2 158463 bge0 >> > 127.0.0.1 127.0.0.1 UH 5 398913 lo0 >> > 172.16.0.0 172.16.0.120 U 4 135241319 bge0 >> > 172.18.0.0 172.18.0.120 U 3 26 >> pFFFF.ibp0 >> > >> > Routing Table: IPv6 >> > Destination/Mask Gateway Flags Ref >> Use If >> > --------------------------- --------------------------- ----- --- >> ------- ----- >> > ::1 ::1 UH 2 >> 0 lo0 >> > fe80::/10 fe80::250:45ff:fe5c:2b0 U 2 >> 0 bge0 >> > >> > On Tue, Dec 16, 2014 at 2:55 AM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> > Paul, >> > >> > could you please send the output of >> > ifconfig -a >> > netstat -nr >> > >> > on the three hosts you are using >> > (i assume you are still invoking mpirun from one node, and tasks are >> running on two other nodes) >> > >> > Cheers, >> > >> > Gilles >> > >> > >> > On 2014/12/16 16:00, Paul Hargrove wrote: >> >> Gilles, >> >> >> >> I looked again carefully and I am *NOT* finding -D_REENTRANT passed to >> most >> >> compilations. >> >> It appears to be used for building libevent and vt, but nothing else. >> >> The output from configure contains >> >> >> >> checking if more special flags are required for pthreads... >> -D_REENTRANT >> >> >> >> only in the libevent and vt sub-configure portions. >> >> >> >> When configured for gcc on Solaris-11 I see the following in configure >> >> >> >> checking for C optimization flags... -m64 -D_REENTRANT -g >> >> -finline-functions -fno-strict-aliasing >> >> >> >> but with CC=cc the equivalent line is >> >> >> >> checking for C optimization flags... -m64 -g >> >> >> >> In both cases the "-m64" is from the CFLAGS I have passed to configure. >> >> >> >> However, when I use CFLAGS="-m64 -D_REENTRANT" the problem DOES NOT go >> away. >> >> I see >> >> >> >> [pcp-j-20:24740] mca_oob_tcp_accept: accept() failed: Error 0 (11). >> >> ------------------------------------------------------------ >> >> A process or daemon was unable to complete a TCP connection >> >> to another process: >> >> Local host: pcp-j-20 >> >> Remote host: 172.18.0.120 >> >> This is usually caused by a firewall on the remote host. Please >> >> check that any firewall (e.g., iptables) has been disabled and >> >> try again. >> >> ------------------------------------------------------------ >> >> >> >> which is at least appears to have a non-zero errno. >> >> A quick grep through /usr/include/sys/errno shows 11 is EAGAIN. >> >> >> >> With the oob.patch you provided the failed accept goes away, BUT the >> >> connection still fails: >> >> >> >> ------------------------------------------------------------ >> >> A process or daemon was unable to complete a TCP connection >> >> to another process: >> >> Local host: pcp-j-20 >> >> Remote host: 172.18.0.120 >> >> This is usually caused by a firewall on the remote host. Please >> >> check that any firewall (e.g., iptables) has been disabled and >> >> try again. >> >> ------------------------------------------------------------ >> >> >> >> >> >> Use of "-mca oob_tcp_if_include bge0" to use a single interface did >> not fix >> >> this. >> >> >> >> >> >> -Paul >> >> >> >> On Mon, Dec 15, 2014 at 7:18 PM, Paul Hargrove >> >> <phhargr...@lbl.gov> >> >> wrote: >> >> >> >>> Gilles, >> >>> >> >>> I am NOT seeing the problem with gcc. >> >>> It is only occurring with the Studio compilers. >> >>> >> >>> As I've already reported, I have tried adding either "-mt" or >> "-mt=yes" to >> >>> both LDFLAGS and --with-wrapper-ldflags. >> >>> >> >>> The "cc" manpage (on the Solaris-10 system I can get to right now) >> says: >> >>> >> >>> -mt Compile and link for multithreaded code. >> >>> >> >>> This option passes -D_REENTRANT to the preprocessor and >> >>> passes -lthread in the correct order to ld. >> >>> >> >>> The -mt option is required if the application or >> >>> libraries are multithreaded. >> >>> >> >>> To ensure proper library linking order, you must use >> >>> this option, rather than -lthread, to link with lib- >> >>> thread. >> >>> >> >>> If you are using POSIX threads, you must link with the >> >>> options -mt -lpthread. The -mt option is necessary >> >>> because libC and libCrun need libthread for a mul- >> >>> tithreaded application. >> >>> >> >>> If you compile and link in separate steps and you com- >> >>> pile with -mt, you might get unexpected results. If you >> >>> compile one translation unit with -mt, compile all >> >>> units of the program with -mt. >> >>> >> >>> I cannot connect to my Solaris-11 system right now, but I recall the >> text >> >>> to be quite similar. >> >>> >> >>> -Paul >> >>> >> >>> On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet < >> >>> >> >>> gilles.gouaillar...@iferc.org >> >>> > wrote: >> >>> >> >>> >> >>>> Paul, >> >>>> >> >>>> did you manually set -mt ? >> >>>> >> >>>> if i remember correctly, solaris 11 (at least with gcc compilers) do >> not >> >>>> need any flags >> >>>> (except the -D_REENTRANT that is added automatically) >> >>>> >> >>>> Cheers, >> >>>> >> >>>> Gilles >> >>>> >> >>>> >> >>>> On 2014/12/16 12:10, Paul Hargrove wrote: >> >>>> >> >>>> Gilles, >> >>>> >> >>>> I will try the patch when I can. >> >>>> However, our network is undergoing network maintenance right now, >> leaving >> >>>> me unable to reach the necessary hosts. >> >>>> >> >>>> As for -D_REENTRANT, I had already reported having verified in the >> "make" >> >>>> output that it had been added automatically. >> >>>> >> >>>> Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to >> the >> >>>> preprocessor. >> >>>> >> >>>> -Paul >> >>>> >> >>>> On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet >> >>>> <gilles.gouaillar...@iferc.org> >> >>>> wrote: >> >>>> >> >>>> >> >>>> Paul, >> >>>> >> >>>> could you please make sure configure added "-D_REENTRANT" to the >> CFLAGS ? >> >>>> /* otherwise, errno is a global variable instead of a per thread >> variable, >> >>>> which can >> >>>> explains some weird behaviour. note this should have been already >> fixed */ >> >>>> >> >>>> assuming -D_REENTRANT is set, could you please give the attached >> patch a >> >>>> try ? >> >>>> >> >>>> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the >> confusing >> >>>> error message >> >>>> e.g. failed: Error 0 (0) >> >>>> >> >>>> FWIW, master is also affected. >> >>>> >> >>>> Cheers, >> >>>> >> >>>> Gilles >> >>>> >> >>>> >> >>>> On 2014/12/16 10:47, Paul Hargrove wrote: >> >>>> >> >>>> I have tried with a oob_tcp_if_include setting so that there is now >> only 1 >> >>>> interface. >> >>>> Even with just one interface and -mt=yes in both LDFLAGS and >> >>>> wrapper-ldflags I *still* getting messages like >> >>>> >> >>>> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0). >> >>>> ------------------------------ >> >>>> >> >>>> ------------------------------ >> >>>> A process or daemon was unable to complete a TCP connection >> >>>> to another process: >> >>>> Local host: pcp-j-20 >> >>>> Remote host: 172.16.0.120 >> >>>> This is usually caused by a firewall on the remote host. Please >> >>>> check that any firewall (e.g., iptables) has been disabled and >> >>>> try again. >> >>>> ------------------------------ >> >>>> ------------------------------ >> >>>> >> >>>> >> >>>> I am getting less certain that my speculation about thread-safe libs >> is >> >>>> correct. >> >>>> >> >>>> -Paul >> >>>> >> >>>> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove >> >>>> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> < >> phhargr...@lbl.gov> >> >>>> wrote: >> >>>> >> >>>> A little more reading finds that... >> >>>> >> >>>> Docs says that one needs "-mt" without the "=yes". >> >>>> That will work for both old and new compilers, where "-mt=yes" chokes >> >>>> older ones. >> >>>> >> >>>> Also, man pages say "-mt" must come before "-lpthread" in the link >> command. >> >>>> >> >>>> -Paul >> >>>> >> >>>> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove >> >>>> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> < >> phhargr...@lbl.gov> >> >>>> >> >>>> wrote: >> >>>> >> >>>> >> >>>> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain >> >>>> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> < >> r...@open-mpi.org> >> >>>> wrote: >> >>>> >> >>>> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing >> the >> >>>> multi-threaded C libraries, apparently need "-mt=yes" in both >> compile and >> >>>> link. Need someone to investigate. >> >>>> >> >>>> >> >>>> The lack of multi-thread libraries is my SPECULATION. >> >>>> >> >>>> The fact that configuring with LDFLAGS=-mt=yes did not help may or >> may >> >>>> not prove anything. >> >>>> I didn't see them in "mpicc -show" and so maybe they needed to be in >> >>>> wrapper-ldflags instead. >> >>>> My time this week is quite limited, but I can "fire an forget" tests >> of >> >>>> any tarballs you provide. >> >>>> >> >>>> -Paul >> >>>> >> >>>> -- >> >>>> Paul H. Hargrove >> >>>> phhargr...@lbl.gov >> >>>> >> >>>> >> >>>> Computer Languages & Systems Software (CLaSS) Group >> >>>> Computer Science Department Tel: >> >>>> +1-510-495-2352 >> >>>> >> >>>> Lawrence Berkeley National Laboratory Fax: >> >>>> +1-510-486-6900 >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Paul H. Hargrove >> >>>> phhargr...@lbl.gov >> >>>> >> >>>> Computer Languages & Systems Software (CLaSS) Group >> >>>> Computer Science Department Tel: >> >>>> +1-510-495-2352 >> >>>> >> >>>> Lawrence Berkeley National Laboratory Fax: >> >>>> +1-510-486-6900 >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> devel mailing >> >>>> listde...@open-mpi.org >> >>>> >> >>>> Subscription: >> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>> >> >>>> Link to this post: >> >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> devel mailing >> >>>> listde...@open-mpi.org >> >>>> >> >>>> Subscription: >> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>> >> >>>> Link to this post: >> >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16608.php >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> devel mailing >> >>>> listde...@open-mpi.org >> >>>> >> >>>> Subscription: >> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>> >> >>>> >> >>>> Link to this post: >> >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16610.php >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> devel mailing list >> >>>> >> >>>> de...@open-mpi.org >> >>>> >> >>>> Subscription: >> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>> >> >>>> Link to this post: >> >>>> >> >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16611.php >> >>>> >> >>>> >> >>>> >> >>> >> >>> -- >> >>> Paul H. Hargrove >> >>> phhargr...@lbl.gov >> >>> >> >>> Computer Languages & Systems Software (CLaSS) Group >> >>> Computer Science Department Tel: >> >>> +1-510-495-2352 >> >>> >> >>> Lawrence Berkeley National Laboratory Fax: >> >>> +1-510-486-6900 >> >>> >> >>> >> >>> >> >> >> >> >> >> _______________________________________________ >> >> devel mailing list >> >> >> >> de...@open-mpi.org >> >> >> >> Subscription: >> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> Link to this post: >> >> http://www.open-mpi.org/community/lists/devel/2014/12/16613.php >> > >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16615.php >> > >> > >> > -- >> > Paul H. Hargrove phhargr...@lbl.gov >> > Computer Languages & Systems Software (CLaSS) Group >> > Computer Science Department Tel: +1-510-495-2352 >> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16617.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16660.php >> > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16663.php >