Ralph, Does some part of the "timer that is firing to indicate a failed connection attempt" theory explain the case of singletons hanging? I'm just bringing this up in case you might be looking in the wrong direction.
-Paul On Fri, Dec 20, 2013 at 4:15 PM, Ralph Castain <r...@open-mpi.org> wrote: > This is the same problem Jeff and I are looking at on Solaris - it > requires a slow machine to make it appear. I'm investigating and think I > know where the issue might lie (a timer that is firing to indicate a failed > connection attempt and causing a race condition) > > > On Dec 20, 2013, at 4:02 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > FWIW: > I've confirmed that this is REGRESSION relative to 1.7.2, which works fine > on OpenBSD-5 > > I could not build 1.7.3 due to some of issues fixed for 1.7.4rc in the > past 24 hours. > I am going to try back-porting the fix(es) to see if 1.7.3 works or not . > > -Paul > > > On Fri, Dec 20, 2013 at 3:16 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > >> Below is the backtrace again, this time configured w/ --enable-debug and >> for all threads. >> -Paul >> >> Thread 2 (thread 1021110): >> #0 0x00001bc0ef6c5e3a in nanosleep () at <stdin>:2 >> #1 0x00001bc0f317c2d4 in nanosleep (rqtp=0x7f7ffffbc900, rmtp=0x0) >> at /usr/src/lib/librthread/rthread_cancel.c:274 >> #2 0x00001bc0f2cd4621 in orte_routed_base_register_sync (setup=true) >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/routed/base/routed_base_fns.c:344 >> #3 0x00001bc0efc5d602 in init_routes (job=3563782145, ndat=0x0) >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/routed/binomial/routed_binomial.c:705 >> #4 0x00001bc0f2c9c832 in orte_ess_base_app_setup (db_restrict_local=true) >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/ess/base/ess_base_std_app.c:233 >> #5 0x00001bc0f39ea9ec in rte_init () >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/mca/ess/env/ess_env_module.c:146 >> #6 0x00001bc0f2c68764 in orte_init (pargc=0x0, pargv=0x0, flags=32) >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/runtime/orte_init.c:158 >> #7 0x00001bc0f75061c5 in ompi_mpi_init (argc=1, argv=0x7f7ffffbced0, >> requested=0, provided=0x7f7ffffbce38) >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/ompi/runtime/ompi_mpi_init.c:451 >> #8 0x00001bc0f7544b96 in PMPI_Init (argc=0x7f7ffffbce6c, >> argv=0x7f7ffffbce60) at pinit.c:84 >> #9 0x00001bbeec701093 in main (argc=1, argv=0x7f7ffffbced0) at >> ring_c.c:19 >> Current language: auto; currently asm >> >> Thread 1 (thread 1023703): >> #0 0x00001bc0ef6d68fa in poll () at <stdin>:2 >> #1 0x00001bc0f317c0fd in poll (fds=0x1bc0f9482d00, nfds=2, timeout=-1) >> at /usr/src/lib/librthread/rthread_cancel.c:331 >> #2 0x00001bc0eebf47a8 in poll_dispatch (base=0x1bc0f5987400, tv=0x0) >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/opal/mca/event/libevent2021/libevent/poll.c:165 >> #3 0x00001bc0eebe8314 in opal_libevent2021_event_base_loop >> (base=0x1bc0f5987400, flags=1) >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/opal/mca/event/libevent2021/libevent/event.c:1631 >> #4 0x00001bc0f2c68855 in orte_progress_thread_engine (obj=0x1bc0f310e160) >> at >> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7-latest/orte/runtime/orte_init.c:180 >> #5 0x00001bc0f317911e in _rthread_start (v=Variable "v" is not available. >> ) at /usr/src/lib/librthread/rthread.c:122 >> #6 0x00001bc0ef6c003b in __tfork_thread () at >> /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:75 >> Cannot access memory at address 0x1bc0f857c000 >> >> >> >> On Fri, Dec 20, 2013 at 2:48 PM, Paul Hargrove <phhargr...@lbl.gov>wrote: >> >>> Brian, >>> >>> Of course, I should have thought of that myself. >>> See below for backtrace from a singleton run. >>> >>> I'm starting an --enable-debug build to maybe get some line number info >>> too. >>> >>> -Paul >>> >>> (gdb) where >>> #0 0x00000406457a9e3a in nanosleep () at <stdin>:2 >>> #1 0x000004063947e2d4 in nanosleep (rqtp=0x7f7ffffeca30, rmtp=0x0) >>> at /usr/src/lib/librthread/rthread_cancel.c:274 >>> #2 0x0000040644a5a89b in orte_routed_base_register_sync () >>> from >>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0 >>> #3 0x00000406490d943c in init_routes () >>> from >>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/openmpi/mca_routed_binomial.so >>> #4 0x0000040644a3c37f in orte_ess_base_app_setup () >>> from >>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0 >>> #5 0x000004063eb1797d in rte_init () >>> from >>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/openmpi/mca_ess_env.so >>> #6 0x0000040644a1a3fe in orte_init () >>> from >>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libopen-rte.so.7.0 >>> #7 0x00000406482c7976 in ompi_mpi_init () >>> from >>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libmpi.so.4.0 >>> #8 0x00000406482eac92 in PMPI_Init () >>> from >>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/lib/libmpi.so.4.0 >>> #9 0x0000040438c01093 in main (argc=1, argv=0x7f7ffffece60) at >>> ring_c.c:19 >>> Current language: auto; currently asm >>> >>> >>> >>> On Fri, Dec 20, 2013 at 2:38 PM, Barrett, Brian W <bwba...@sandia.gov>wrote: >>> >>>> Paul - >>>> >>>> Any chance you could grab a stack trace from the mpi app? That's >>>> probably the fastest next step >>>> >>>> Brian >>>> >>>> >>>> >>>> Sent with Good (www.good.com) >>>> >>>> >>>> -----Original Message----- >>>> *From: *Paul Hargrove [phhargr...@lbl.gov] >>>> *Sent: *Friday, December 20, 2013 03:33 PM Mountain Standard Time >>>> *To: *Open MPI Developers >>>> *Subject: *[EXTERNAL] [OMPI devel] 1.7.4rc2r30031 - OpenBSD-5 mpirun >>>> hangs >>>> >>>> With plenty of help from Jeff and Ralph's bug fixes in the past 24 >>>> hours, I can now build OMPI for NetBSD. However, running even a simple >>>> example fails: >>>> >>>> Having set PATH and LD_LIBARY_PATH: >>>> $ mpirun -np 1 examples/ring_c >>>> just hangs >>>> >>>> Output from "top" shows idle procs: >>>> PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU >>>> COMMAND >>>> 31841 phargrov 10 0 2140K 3960K sleep/1 nanosle 0:00 0.00% >>>> ring_c >>>> 13490 phargrov 2 0 2540K 4892K sleep/1 poll 0:00 0.00% >>>> orterun >>>> >>>> Distrusting then env vars and relying instead on the auto-prefix >>>> behavior: >>>> $ /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/INST/bin/mpirun >>>> -np 1 examples/ring_c >>>> also hangs >>>> >>>> Not sure exactly what to infer from this, but a "bogus" btl doesn't >>>> produce any complaint, which may indicate how far startup got: >>>> $ mpirun -mca btl bogus -np 1 examples/ring_c >>>> Still hangs, and no complaint about the blt selection >>>> >>>> All three cases above are singleton (-np 1) runs, but the behavior >>>> with "-np 2" is the same. >>>> >>>> This does NOT appear to be an ORTE problem: >>>> -bash-4.2$ orterun -np 1 date >>>> Fri Dec 20 14:11:42 PST 2013 >>>> -bash-4.2$ orterun -np 2 date >>>> Fri Dec 20 14:11:45 PST 2013 >>>> Fri Dec 20 14:11:45 PST 2013 >>>> >>>> Let me know what sort of verbose mca parameters to set and I'll >>>> collect the info. >>>> Compressed output of "ompi_info --all" is attached. >>>> >>>> -Paul >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> Future Technologies Group >>>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Future Technologies Group >>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group >> Computer and Data Sciences Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900