All right - I’ll surrender and remove the timeout. Will release rc4 later tonight.
Sorry for putting you thru this Paul - for some reason, these problems aren’t showing up elsewhere. > On Dec 12, 2014, at 3:37 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > > > On Fri, Dec 12, 2014 at 2:58 PM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > Aha! You are the first to fall thru the timeout. How interesting. > > When it comes to the release candidates, I seem to own a lot of "firsts". > It is not as fun as one might imagine :-). > > Can you please try adding “-mca oob_tcp_connect_timeout 5:0”? > > That appeared to produce a timeout of about 5 SECONDS ("time mpirun" reports > 5.8s elapsed). Was that really the intent? No difference if I change "5:0" > to "5:00". So, you might have an "extra" bug lurking there. > > > New stderr attached for > $ mpirun -mca oob_tcp_if_include bge0 -mca oob_tcp_connect_timeout 5:0 -mca > oob_base_verbose 20 -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 > examples/ring_c > > Assuming "5:0" was intended to get a 5 MINUTE timeout, I also tried "-mca > oob_tcp_connect_timeout 300", and have also attached the resulting stderr. > > No joy for either timeout value. > > -Paul > > > > On Dec 12, 2014, at 8:53 AM, Paul Hargrove <phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov>> wrote: >> >> >> First, I want to ask what became of the issue discussed in this thread? >> http://www.open-mpi.org/community/lists/devel/2014/11/16160.php >> <http://www.open-mpi.org/community/lists/devel/2014/11/16160.php> >> I though we had concluded that one just needed -D_REENTRANT. >> I mention that only for completeness, because I think my current problem is >> different. >> >> The following works fine with 1.8.3, making the current behavior a >> regression. >> >> I am still on the same system as that previous report, and still/again see a >> message like the following: >> >> ------------------------------------------------------------ >> A process or daemon was unable to complete a TCP connection >> to another process: >> Local host: pcp-j-19 >> Remote host: 172.18.0.120 >> This is usually caused by a firewall on the remote host. Please >> check that any firewall (e.g., iptables) has been disabled and >> try again. >> ------------------------------------------------------------ >> -------------------------------------------------------------------------- >> ORTE was unable to reliably start one or more daemons. >> This usually is caused by: >> [...etc...] >> >> It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the >> address 172.18.0.120 are on different subnets. >> >> I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at >> configure time (I didn't bother to check if it there by default now or not). >> >> NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only >> the 172.16.0.120 subnet. >> IN FACT, the message is the same with that option, other than "172.18" >> changing to "172.16". >> >> I've attached the output generated by "-mca oob_base_verbose 20" both with >> and without the oob_tcp_if_include. >> >> I should also note that that the following is my full mpirun command, which >> excludes the tcp BTL. >> pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20 -mca >> btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c >> >> >> -Paul >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> <tel:%2B1-510-495-2352> >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> <tel:%2B1-510-486-6900><stdout-inc.txt><stderr-2if.txt>_______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16551.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16551.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16561.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16561.php> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > <stderr-inc-5_0.txt><stderr-inc-300.txt>_______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16565.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16565.php>