I'm guessing that this is related to the threading changes that came with some ORTE changes between 1.7.3 and 1.7.4. I'm building a FreeBSD VM to see if I can make some progress on that, but I live in the land of slow bandwidth, so it might not be for a couple days.
Brian On 12/20/13 5:00 PM, "Paul Hargrove" <phhargr...@lbl.gov> wrote: >FWIW: >I've confirmed that this is REGRESSION relative to 1.7.3, which works >fine on FreeBSD-9 > > >-Paul > > > >On Fri, Dec 20, 2013 at 3:30 PM, Paul Hargrove ><phhargr...@lbl.gov> wrote: > >And the FreeBSD backtraces again, this time configured with >--enable-debug and for all threads: > > >The 100%-cpu ring_c process: > > >(gdb) thread apply all where > > >Thread 2 (Thread 802007400 (LWP 182916/ring_c)): >#0 0x0000000800de7aac in sched_yield () from /lib/libc.so.7 >#1 0x00000008013c7a5a in opal_progress () > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >pal/runtime/opal_progress.c:199 >#2 0x00000008008670ec in ompi_mpi_init (argc=1, argv=0x7fffffffd3e0, >requested=0, provided=0x7fffffffd328) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >mpi/runtime/ompi_mpi_init.c:618 >#3 0x000000080089aefe in PMPI_Init (argc=0x7fffffffd36c, >argv=0x7fffffffd360) at pinit.c:84 >#4 0x0000000000400963 in main (argc=1, argv=0x7fffffffd3e0) at >ring_c.c:19 > > >Thread 1 (Thread 802007800 (LWP 186415/ring_c)): >#0 0x0000000800e2711c in poll () from /lib/libc.so.7 >#1 0x0000000800b727fe in poll () from /lib/libthr.so.3 >#2 0x000000080142edc1 in poll_dispatch (base=0x8020cd900, tv=0x0) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >pal/mca/event/libevent2021/libevent/poll.c:165 >#3 0x0000000801422ca1 in opal_libevent2021_event_base_loop >(base=0x8020cd900, flags=1) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >pal/mca/event/libevent2021/libevent/event.c:1631 >#4 0x00000008010f2c22 in orte_progress_thread_engine (obj=0x80139b160) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >rte/runtime/orte_init.c:180 >#5 0x0000000800b700a4 in pthread_getprio () from /lib/libthr.so.3 >#6 0x0000000000000000 in ?? () >Error accessing memory address 0x7fffffbfe000: Bad address. > > > > > >The idle ring_c process: > > >(gdb) thread apply all where > > >Thread 2 (Thread 802007400 (LWP 183983/ring_c)): >#0 0x0000000800e6c44c in nanosleep () from /lib/libc.so.7 >#1 0x0000000800b729d5 in nanosleep () from /lib/libthr.so.3 >#2 0x0000000801161618 in orte_routed_base_register_sync (setup=true) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >rte/mca/routed/base/routed_base_fns.c:344 >#3 0x0000000802a0a0a2 in init_routes (job=2628321281 <tel:2628321281>, >ndat=0x0) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >rte/mca/routed/binomial/routed_binomial.c:705 >#4 0x00000008011272ce in orte_ess_base_app_setup (db_restrict_local=true) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >rte/mca/ess/base/ess_base_std_app.c:233 >#5 0x0000000802401408 in rte_init () > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >rte/mca/ess/env/ess_env_module.c:146 >#6 0x00000008010f2b28 in orte_init (pargc=0x0, pargv=0x0, flags=32) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >rte/runtime/orte_init.c:158 >#7 0x0000000800866bde in ompi_mpi_init (argc=1, argv=0x7fffffffd3e0, >requested=0, provided=0x7fffffffd328) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >mpi/runtime/ompi_mpi_init.c:451 >#8 0x000000080089aefe in PMPI_Init (argc=0x7fffffffd36c, >argv=0x7fffffffd360) at pinit.c:84 >#9 0x0000000000400963 in main (argc=1, argv=0x7fffffffd3e0) at >ring_c.c:19 > > >Thread 1 (Thread 802007800 (LWP 186412/ring_c)): >#0 0x0000000800e2711c in poll () from /lib/libc.so.7 >#1 0x0000000800b727fe in poll () from /lib/libthr.so.3 >#2 0x000000080142edc1 in poll_dispatch (base=0x8020cd900, tv=0x0) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >pal/mca/event/libevent2021/libevent/poll.c:165 >#3 0x0000000801422ca1 in opal_libevent2021_event_base_loop >(base=0x8020cd900, flags=1) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >pal/mca/event/libevent2021/libevent/event.c:1631 >#4 0x00000008010f2c22 in orte_progress_thread_engine (obj=0x80139b160) > at >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7-latest/o >rte/runtime/orte_init.c:180 >#5 0x0000000800b700a4 in pthread_getprio () from /lib/libthr.so.3 >#6 0x0000000000000000 in ?? () >Error accessing memory address 0x7fffffbfe000: Bad address. > > > > > >-Paul > > >On Fri, Dec 20, 2013 at 2:59 PM, Paul Hargrove ><phhargr...@lbl.gov> wrote: > >This case is not quite like my OpenBSD-5 report. >On FreeBSD-9 I *can* run singletons, but "-np 2" hangs. > > >The following hangs: >$ mpirun -np 2 examples/ring_c > > > >The following complains about the "bogus" btl selection. >So this is not the same as my problem with OpenBSD-5: >$ mpirun -mca btl bogus -np 2 examples/ring_c >[freebsd9-amd64.qemu:05926] mca: base: components_open: component pml / >bfo open function failed >[freebsd9-amd64.qemu:05926] mca: base: components_open: component pml / >ob1 open function failed >[freebsd9-amd64.qemu:05926] PML ob1 cannot be selected >-------------------------------------------------------------------------- >A requested component was not found, or was unable to be opened. This >means that this component is either not installed or is unable to be >used on your system (e.g., sometimes this means that shared libraries >that the component requires are unable to be found/loaded). Note that >Open MPI stopped checking at the first component that it did not find. > > >Host: freebsd9-amd64.qemu >Framework: btl >Component: bogus >-------------------------------------------------------------------------- >-------------------------------------------------------------------------- >No available pml components were found! > > >This means that there are no components of this type installed on your >system or all the components reported that they could not be used. > > >This is a fatal error; your MPI process is likely to abort. Check the >output of the "ompi_info" command and ensure that components of this >type are available on your system. You may also wish to check the >value of the "component_path" MCA parameter and ensure that it has at >least one directory that contains valid MCA components. >-------------------------------------------------------------------------- > > > > > >For the non-bogus case, "top" show one idle and one active ring_c process: > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > 5933 phargrov 2 29 0 98M 6384K select 1 0:32 100.00% ring_c > 5931 phargrov 2 20 0 77844K 4856K select 0 0:00 0.00% orterun > 5932 phargrov 2 24 0 51652K 4960K select 0 0:00 0.00% ring_c > > > >A backtrace for the 100%-cpu ring_c process: >(gdb) where >#0 0x0000000800d9811c in poll () from /lib/libc.so.7 >#1 0x0000000800ae37fe in poll () from /lib/libthr.so.3 >#2 0x00000008013259aa in poll_dispatch () > from >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal >.so.7 >#3 0x000000080131eb50 in opal_libevent2021_event_base_loop () > from >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal >.so.7 >#4 0x000000080106395d in orte_progress_thread_engine () > from >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-rte >.so.7 >#5 0x0000000800ae10a4 in pthread_getprio () from /lib/libthr.so.3 >#6 0x0000000000000000 in ?? () >Error accessing memory address 0x7fffffbfe000: Bad address. > > > > > >And for the idle ring_c process: >(gdb) where >#0 0x0000000800d9811c in poll () from /lib/libc.so.7 >#1 0x0000000800ae37fe in poll () from /lib/libthr.so.3 >#2 0x00000008013259aa in poll_dispatch () > from >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal >.so.7 >#3 0x000000080131eb50 in opal_libevent2021_event_base_loop () > from >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-pal >.so.7 >#4 0x000000080106395d in orte_progress_thread_engine () > from >/home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/INST/lib/libopen-rte >.so.7 >#5 0x0000000800ae10a4 in pthread_getprio () from /lib/libthr.so.3 >#6 0x0000000000000000 in ?? () >Error accessing memory address 0x7fffffbfe000: Bad address. > > > > > >They look to be the same, but I double checked that these are correct. > > >-Paul > > > >-- >Paul H. Hargrove phhargr...@lbl.gov >Future Technologies Group >Computer and Data Sciences Department Tel: >+1-510-495-2352 <tel:%2B1-510-495-2352> >Lawrence Berkeley National Laboratory Fax: >+1-510-486-6900 <tel:%2B1-510-486-6900> > > > > > > > > > >-- >Paul H. Hargrove phhargr...@lbl.gov >Future Technologies Group >Computer and Data Sciences Department Tel: >+1-510-495-2352 <tel:%2B1-510-495-2352> >Lawrence Berkeley National Laboratory Fax: >+1-510-486-6900 <tel:%2B1-510-486-6900> > > > > > > > > > >-- >Paul H. Hargrove phhargr...@lbl.gov >Future Technologies Group >Computer and Data Sciences Department Tel: +1-510-495-2352 >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories