Short version: v1.8 nightly (v1.8.3-313-g54c80c2) PASSED my testing.
In full: I gave openmpi-v1.8.3-313-g54c80c2 a try. In this test I did not add -D_REENTRANT or -mt to any flags at configure time. In addition to --prefix, I passed the following: --enable-debug --with-verbs \ CC=cc CXX=CC FC=f90 \ CFLAGS=-m64 --with-wrapper-cflags=-m64 \ FCFLAGS=-m64 --with-wrapper-fcflags=-m64 \ CXXFLAGS='-m64 -library=stlport4' --with-wrapper-cxxflags='-m64 -library=stlport4' So, this was essentially an "out of the box" build with the configure options needed for the compilers and ABI I desire. They are the same options I have used successfully with 1.8.3. So, I believe the regression I had observed relative to 1.8.3 has ben resolved. I am going to run the nightly on other configs on both my Solaris-11/x86-64 and Solaris-10/SPARC systems. I just want to be sure some other compile/abi/arch combination didn't get broken by accident. I will post my results to the list (probably Thu lunch time in California). -Paul On Wed, Dec 17, 2014 at 2:54 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > > Paul -- > > The __sun macro check is now in the OMPI 1.8 tree, and is in the latest > nightly tarball. > > If I'm following this thread right -- and I might not be! -- I think > Gilles is saying that now that the __sun check is in, it should fix this > -mt/-D_REENTRANT/whatever problem. > > Can you confirm? > > > On Dec 16, 2014, at 1:55 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > > Gilles, > > > > I am running mpirun on a host that ALSO will run one of the application > processes. > > Requested ifconfig and netstat outputs appear below. > > > > -Paul > > > > [phargrov@pcp-j-20 ~]$ ifconfig -a > > lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu > 8232 index 1 > > inet 127.0.0.1 netmask ff000000 > > bge0: flags=1004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4> mtu 1500 > index 2 > > inet 172.16.0.120 netmask ffff0000 broadcast 172.16.255.255 > > pFFFF.ibp0: > flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU> mtu 2044 > index 3 > > inet 172.18.0.120 netmask ffff0000 broadcast 172.18.255.255 > > lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu > 8252 index 1 > > inet6 ::1/128 > > bge0: flags=20002004841<UP,RUNNING,MULTICAST,DHCP,IPv6> mtu 1500 index 2 > > inet6 fe80::250:45ff:fe5c:2b0/10 > > [phargrov@pcp-j-20 ~]$ netstat -nr > > > > Routing Table: IPv4 > > Destination Gateway Flags Ref Use > Interface > > -------------------- -------------------- ----- ----- ---------- > --------- > > default 172.16.254.1 UG 2 158463 bge0 > > 127.0.0.1 127.0.0.1 UH 5 398913 lo0 > > 172.16.0.0 172.16.0.120 U 4 135241319 bge0 > > 172.18.0.0 172.18.0.120 U 3 26 > pFFFF.ibp0 > > > > Routing Table: IPv6 > > Destination/Mask Gateway Flags Ref Use > If > > --------------------------- --------------------------- ----- --- > ------- ----- > > ::1 ::1 UH 2 > 0 lo0 > > fe80::/10 fe80::250:45ff:fe5c:2b0 U 2 > 0 bge0 > > > > On Tue, Dec 16, 2014 at 2:55 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > Paul, > > > > could you please send the output of > > ifconfig -a > > netstat -nr > > > > on the three hosts you are using > > (i assume you are still invoking mpirun from one node, and tasks are > running on two other nodes) > > > > Cheers, > > > > Gilles > > > > > > On 2014/12/16 16:00, Paul Hargrove wrote: > >> Gilles, > >> > >> I looked again carefully and I am *NOT* finding -D_REENTRANT passed to > most > >> compilations. > >> It appears to be used for building libevent and vt, but nothing else. > >> The output from configure contains > >> > >> checking if more special flags are required for pthreads... -D_REENTRANT > >> > >> only in the libevent and vt sub-configure portions. > >> > >> When configured for gcc on Solaris-11 I see the following in configure > >> > >> checking for C optimization flags... -m64 -D_REENTRANT -g > >> -finline-functions -fno-strict-aliasing > >> > >> but with CC=cc the equivalent line is > >> > >> checking for C optimization flags... -m64 -g > >> > >> In both cases the "-m64" is from the CFLAGS I have passed to configure. > >> > >> However, when I use CFLAGS="-m64 -D_REENTRANT" the problem DOES NOT go > away. > >> I see > >> > >> [pcp-j-20:24740] mca_oob_tcp_accept: accept() failed: Error 0 (11). > >> ------------------------------------------------------------ > >> A process or daemon was unable to complete a TCP connection > >> to another process: > >> Local host: pcp-j-20 > >> Remote host: 172.18.0.120 > >> This is usually caused by a firewall on the remote host. Please > >> check that any firewall (e.g., iptables) has been disabled and > >> try again. > >> ------------------------------------------------------------ > >> > >> which is at least appears to have a non-zero errno. > >> A quick grep through /usr/include/sys/errno shows 11 is EAGAIN. > >> > >> With the oob.patch you provided the failed accept goes away, BUT the > >> connection still fails: > >> > >> ------------------------------------------------------------ > >> A process or daemon was unable to complete a TCP connection > >> to another process: > >> Local host: pcp-j-20 > >> Remote host: 172.18.0.120 > >> This is usually caused by a firewall on the remote host. Please > >> check that any firewall (e.g., iptables) has been disabled and > >> try again. > >> ------------------------------------------------------------ > >> > >> > >> Use of "-mca oob_tcp_if_include bge0" to use a single interface did not > fix > >> this. > >> > >> > >> -Paul > >> > >> On Mon, Dec 15, 2014 at 7:18 PM, Paul Hargrove > >> <phhargr...@lbl.gov> > >> wrote: > >> > >>> Gilles, > >>> > >>> I am NOT seeing the problem with gcc. > >>> It is only occurring with the Studio compilers. > >>> > >>> As I've already reported, I have tried adding either "-mt" or > "-mt=yes" to > >>> both LDFLAGS and --with-wrapper-ldflags. > >>> > >>> The "cc" manpage (on the Solaris-10 system I can get to right now) > says: > >>> > >>> -mt Compile and link for multithreaded code. > >>> > >>> This option passes -D_REENTRANT to the preprocessor and > >>> passes -lthread in the correct order to ld. > >>> > >>> The -mt option is required if the application or > >>> libraries are multithreaded. > >>> > >>> To ensure proper library linking order, you must use > >>> this option, rather than -lthread, to link with lib- > >>> thread. > >>> > >>> If you are using POSIX threads, you must link with the > >>> options -mt -lpthread. The -mt option is necessary > >>> because libC and libCrun need libthread for a mul- > >>> tithreaded application. > >>> > >>> If you compile and link in separate steps and you com- > >>> pile with -mt, you might get unexpected results. If you > >>> compile one translation unit with -mt, compile all > >>> units of the program with -mt. > >>> > >>> I cannot connect to my Solaris-11 system right now, but I recall the > text > >>> to be quite similar. > >>> > >>> -Paul > >>> > >>> On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet < > >>> > >>> gilles.gouaillar...@iferc.org > >>> > wrote: > >>> > >>> > >>>> Paul, > >>>> > >>>> did you manually set -mt ? > >>>> > >>>> if i remember correctly, solaris 11 (at least with gcc compilers) do > not > >>>> need any flags > >>>> (except the -D_REENTRANT that is added automatically) > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> > >>>> On 2014/12/16 12:10, Paul Hargrove wrote: > >>>> > >>>> Gilles, > >>>> > >>>> I will try the patch when I can. > >>>> However, our network is undergoing network maintenance right now, > leaving > >>>> me unable to reach the necessary hosts. > >>>> > >>>> As for -D_REENTRANT, I had already reported having verified in the > "make" > >>>> output that it had been added automatically. > >>>> > >>>> Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to > the > >>>> preprocessor. > >>>> > >>>> -Paul > >>>> > >>>> On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet > >>>> <gilles.gouaillar...@iferc.org> > >>>> wrote: > >>>> > >>>> > >>>> Paul, > >>>> > >>>> could you please make sure configure added "-D_REENTRANT" to the > CFLAGS ? > >>>> /* otherwise, errno is a global variable instead of a per thread > variable, > >>>> which can > >>>> explains some weird behaviour. note this should have been already > fixed */ > >>>> > >>>> assuming -D_REENTRANT is set, could you please give the attached > patch a > >>>> try ? > >>>> > >>>> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the > confusing > >>>> error message > >>>> e.g. failed: Error 0 (0) > >>>> > >>>> FWIW, master is also affected. > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> > >>>> On 2014/12/16 10:47, Paul Hargrove wrote: > >>>> > >>>> I have tried with a oob_tcp_if_include setting so that there is now > only 1 > >>>> interface. > >>>> Even with just one interface and -mt=yes in both LDFLAGS and > >>>> wrapper-ldflags I *still* getting messages like > >>>> > >>>> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0). > >>>> ------------------------------ > >>>> > >>>> ------------------------------ > >>>> A process or daemon was unable to complete a TCP connection > >>>> to another process: > >>>> Local host: pcp-j-20 > >>>> Remote host: 172.16.0.120 > >>>> This is usually caused by a firewall on the remote host. Please > >>>> check that any firewall (e.g., iptables) has been disabled and > >>>> try again. > >>>> ------------------------------ > >>>> ------------------------------ > >>>> > >>>> > >>>> I am getting less certain that my speculation about thread-safe libs > is > >>>> correct. > >>>> > >>>> -Paul > >>>> > >>>> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove > >>>> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> < > phhargr...@lbl.gov> > >>>> wrote: > >>>> > >>>> A little more reading finds that... > >>>> > >>>> Docs says that one needs "-mt" without the "=yes". > >>>> That will work for both old and new compilers, where "-mt=yes" chokes > >>>> older ones. > >>>> > >>>> Also, man pages say "-mt" must come before "-lpthread" in the link > command. > >>>> > >>>> -Paul > >>>> > >>>> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove > >>>> <phhargr...@lbl.gov> <phhargr...@lbl.gov> <phhargr...@lbl.gov> < > phhargr...@lbl.gov> > >>>> > >>>> wrote: > >>>> > >>>> > >>>> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain > >>>> <r...@open-mpi.org> <r...@open-mpi.org> <r...@open-mpi.org> < > r...@open-mpi.org> > >>>> wrote: > >>>> > >>>> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the > >>>> multi-threaded C libraries, apparently need "-mt=yes" in both compile > and > >>>> link. Need someone to investigate. > >>>> > >>>> > >>>> The lack of multi-thread libraries is my SPECULATION. > >>>> > >>>> The fact that configuring with LDFLAGS=-mt=yes did not help may or may > >>>> not prove anything. > >>>> I didn't see them in "mpicc -show" and so maybe they needed to be in > >>>> wrapper-ldflags instead. > >>>> My time this week is quite limited, but I can "fire an forget" tests > of > >>>> any tarballs you provide. > >>>> > >>>> -Paul > >>>> > >>>> -- > >>>> Paul H. Hargrove > >>>> phhargr...@lbl.gov > >>>> > >>>> > >>>> Computer Languages & Systems Software (CLaSS) Group > >>>> Computer Science Department Tel: > >>>> +1-510-495-2352 > >>>> > >>>> Lawrence Berkeley National Laboratory Fax: > >>>> +1-510-486-6900 > >>>> > >>>> > >>>> > >>>> -- > >>>> Paul H. Hargrove > >>>> phhargr...@lbl.gov > >>>> > >>>> Computer Languages & Systems Software (CLaSS) Group > >>>> Computer Science Department Tel: > >>>> +1-510-495-2352 > >>>> > >>>> Lawrence Berkeley National Laboratory Fax: > >>>> +1-510-486-6900 > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing > >>>> listde...@open-mpi.org > >>>> > >>>> Subscription: > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing > >>>> listde...@open-mpi.org > >>>> > >>>> Subscription: > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16608.php > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing > >>>> listde...@open-mpi.org > >>>> > >>>> Subscription: > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>>> > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16610.php > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> > >>>> de...@open-mpi.org > >>>> > >>>> Subscription: > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>>> Link to this post: > >>>> > >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16611.php > >>>> > >>>> > >>>> > >>> > >>> -- > >>> Paul H. Hargrove > >>> phhargr...@lbl.gov > >>> > >>> Computer Languages & Systems Software (CLaSS) Group > >>> Computer Science Department Tel: > >>> +1-510-495-2352 > >>> > >>> Lawrence Berkeley National Laboratory Fax: > >>> +1-510-486-6900 > >>> > >>> > >>> > >> > >> > >> _______________________________________________ > >> devel mailing list > >> > >> de...@open-mpi.org > >> > >> Subscription: > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2014/12/16613.php > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16615.php > > > > > > -- > > Paul H. Hargrove phhargr...@lbl.gov > > Computer Languages & Systems Software (CLaSS) Group > > Computer Science Department Tel: +1-510-495-2352 > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16617.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16660.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900