Larry, Thanks for the suggestions.
The system logs show only Aug 30 14:16:06 freebsd-amd64 kernel: pid 95624 (orterun), uid 19214: exited on signal 11 (core dumped) However, while "nactivequeues" is a 4-byte integer it *does* appear to be misaligned (suggesting to me that "base" is bogus). (gdb) print sizeof(base->nactivequeues) $1 = 4 (gdb) print &base->nactivequeues $2 = (int *) 0x1000000f7 Digging deeper it looks like the "base" being used by gdb is bogus. Going up one stack frame to the caller, we see a totally different base being passed: (gdb) up #1 0x00000008062e1fd2 in pmix_start_progress_thread () at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix/src/util/progress_threads.c:83 83 event_assign(&block_ev, ev_base, block_pipe[0], (gdb) print ev_base $6 = (pmix_event_base_t *) 0x2ec9500 I am distrusting gdb on this system. Here is what lldb says: (lldb) bt * thread #1, name = 'orterun', stop reason = signal SIGSEGV * frame #0: libopen-pal.so.19`opal_libevent2022_event_assign(ev=0x00000008065482c0, base=<unavailable>, fd=<unavailable>, events=2, callback=<unavailable>, arg=0x0000000000000000) at event.c:1779 frame #1: mca_pmix_pmix112.so`pmix_start_progress_thread at progress_threads.c:83 frame #2: mca_pmix_pmix112.so`PMIx_server_init(module=0x0000000806545be8, info=0x0000000802e16a00, ninfo=2) at pmix_server.c:310 frame #3: mca_pmix_pmix112.so`pmix1_server_init(module=0x0000000800b106a0, info=0x00007fffffffe290) at pmix1_server_south.c:140 frame #4: libopen-rte.so.19`pmix_server_init at pmix_server.c:261 frame #5: mca_ess_hnp.so`rte_init at ess_hnp_module.c:666 frame #6: libopen-rte.so.19`orte_init(pargc=0x00007fffffffe988, pargv=0x00007fffffffe980, flags=4) at orte_init.c:226 frame #7: orterun`orterun(argc=7, argv=0x00007fffffffea18) at orterun.c:831 frame #8: orterun`main(argc=7, argv=0x00007fffffffea18) at main.c:13 frame #9: 0x0000000000403a9f orterun`_start + 383 (lldb) up frame #1: mca_pmix_pmix112.so`pmix_start_progress_thread at progress_threads.c:83 80 event_base_free(ev_base); 81 return NULL; 82 } -> 83 event_assign(&block_ev, ev_base, block_pipe[0], 84 EV_READ, wakeup, NULL); 85 event_add(&block_ev, 0); 86 evlib_active = true; (lldb) print ev_base (pmix_event_base_t *) $2 = 0x0000000002ec9500 (lldb) print *ev_base error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory So, it looks like the SEGV is due to a bad 2nd argument to event_assign(). -Paul On Wed, Aug 30, 2017 at 4:17 PM, Larry Baker <ba...@usgs.gov> wrote: > Paul, > > (gdb) print base->nactivequeues > > > seems like an extraordinarily large number to me. I don't know what the > implications are of the --enable-debug clang option is. Any chance the > SEGFAULT is a debugging trap when an uninitialized value is encountered? > > The other thought I had is an alignment trap if, for example, > nactivequeues is a 64-bit int but is not 64-bit aligned. As far as I can > tell, nactivequeues is a plain int. But, what that is on FreeBSD/amd64, I > do not know. > > Should there be more information in dmesg or a system log file with the > trap code so you can identify whether it is an instruction fetch (VERY > unlikely), an operand fetch, or a store that caused the trap? > > Larry Baker > US Geological Survey > 650-329-5608 <(650)%20329-5608> > ba...@usgs.gov > > > > On 30 Aug 2017, at 3:17:05 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > I am testing the 2.1.2rc3 tarball on FreeBSD-11.1, configured with > --prefix=[...] --enable-debug CC=clang CXX=clang++ > --disable-mpi-fortran --with-hwloc=/usr/local > > The CC/CXX setting are to use the system default compilers (rather than > gcc/g++ in /usr/local/bin). > The --with-hwloc is to avoid issue #3992 > <https://github.com/open-mpi/ompi/issues/3992> (though I have not > determined if that impacts this RC). > > When running ring_c I get a SEGV from orterun, for which a gdb backtrace > is given below. > The one surprising thing (highlighted) in the backtrace is that both the > RHS and LHS of the assignment appear to be valid memory locations. > So, if the backtrace is accurate then I am at a loss as to why a SEGV > occurs. > > -Paul > > > Program terminated with signal 11, Segmentation fault. > [...] > #0 opal_libevent2022_event_assign (ev=0x8065482c0, base=<value optimized > out>, fd=<value optimized out>, > events=2, callback=<value optimized out>, arg=0x0) > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/opal/mca/event/libevent2022/libevent/event.c:1779 > 1779 ev->ev_pri = base->nactivequeues / 2; > (gdb) print base->nactivequeues > $3 = 106201992 > (gdb) print ev->ev_pri > $4 = 0 '\0' > (gdb) where > #0 opal_libevent2022_event_assign (ev=0x8065482c0, base=<value optimized > out>, fd=<value optimized out>, > events=2, callback=<value optimized out>, arg=0x0) > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/opal/mca/event/libevent2022/libevent/event.c:1779 > #1 0x00000008062e1fd2 in pmix_start_progress_thread () > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix/src/util/progress_threads.c:83 > #2 0x00000008063047e4 in PMIx_server_init (module=0x806545be8, > info=0x802e16a00, ninfo=2) > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c:310 > #3 0x00000008062c12f6 in pmix1_server_init (module=0x800b106a0, > info=0x7fffffffe290) > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix1_server_south.c:140 > #4 0x0000000800889f43 in pmix_server_init () > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/orte/orted/pmix/pmix_server.c:261 > #5 0x0000000803e22d87 in rte_init () > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/orte/mca/ess/hnp/ess_hnp_module.c:666 > #6 0x000000080084a45e in orte_init (pargc=0x7fffffffe988, > pargv=0x7fffffffe980, flags=4) > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/orte/runtime/orte_init.c:226 > #7 0x00000000004046a4 in orterun (argc=7, argv=0x7fffffffea18) > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/orte/tools/orterun/orterun.c:831 > #8 0x0000000000403bc2 in main (argc=7, argv=0x7fffffffea18) > at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/ > openmpi-2.1.2rc3/orte/tools/orterun/main.c:13 > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > <(510)%20495-2352> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > <(510)%20486-6900> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel > > > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel