Larry,

Thanks for the suggestions.

The system logs show only
   Aug 30 14:16:06 freebsd-amd64 kernel: pid 95624 (orterun), uid 19214:
exited on signal 11 (core dumped)

However, while "nactivequeues" is a 4-byte integer it *does* appear to be
misaligned (suggesting to me that "base" is bogus).

(gdb) print sizeof(base->nactivequeues)
$1 = 4
(gdb) print &base->nactivequeues
$2 = (int *) 0x1000000f7

Digging deeper it looks like the "base" being used by gdb is bogus.
Going up one stack frame to the caller, we see a totally different base
being passed:

(gdb) up
#1  0x00000008062e1fd2 in pmix_start_progress_thread ()
    at
/home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix/src/util/progress_threads.c:83
83          event_assign(&block_ev, ev_base, block_pipe[0],
(gdb) print ev_base
$6 = (pmix_event_base_t *) 0x2ec9500

I am distrusting gdb on this system.

Here is what lldb says:

(lldb) bt
* thread #1, name = 'orterun', stop reason = signal SIGSEGV
  * frame #0:
libopen-pal.so.19`opal_libevent2022_event_assign(ev=0x00000008065482c0,
base=<unavailable>, fd=<unavailable>, events=2, callback=<unavailable>,
arg=0x0000000000000000) at event.c:1779
    frame #1: mca_pmix_pmix112.so`pmix_start_progress_thread at
progress_threads.c:83
    frame #2:
mca_pmix_pmix112.so`PMIx_server_init(module=0x0000000806545be8,
info=0x0000000802e16a00, ninfo=2) at pmix_server.c:310
    frame #3:
mca_pmix_pmix112.so`pmix1_server_init(module=0x0000000800b106a0,
info=0x00007fffffffe290) at pmix1_server_south.c:140
    frame #4: libopen-rte.so.19`pmix_server_init at pmix_server.c:261
    frame #5: mca_ess_hnp.so`rte_init at ess_hnp_module.c:666
    frame #6: libopen-rte.so.19`orte_init(pargc=0x00007fffffffe988,
pargv=0x00007fffffffe980, flags=4) at orte_init.c:226
    frame #7: orterun`orterun(argc=7, argv=0x00007fffffffea18) at
orterun.c:831
    frame #8: orterun`main(argc=7, argv=0x00007fffffffea18) at main.c:13
    frame #9: 0x0000000000403a9f orterun`_start + 383
(lldb) up
frame #1: mca_pmix_pmix112.so`pmix_start_progress_thread at
progress_threads.c:83
   80           event_base_free(ev_base);
   81           return NULL;
   82       }
-> 83       event_assign(&block_ev, ev_base, block_pipe[0],
   84                    EV_READ, wakeup, NULL);
   85       event_add(&block_ev, 0);
   86       evlib_active = true;
(lldb) print ev_base
(pmix_event_base_t *) $2 = 0x0000000002ec9500
(lldb) print *ev_base
error: Couldn't apply expression side effects : Couldn't dematerialize a
result variable: couldn't read its memory

So, it looks like the SEGV is due to a bad 2nd argument to event_assign().

-Paul

On Wed, Aug 30, 2017 at 4:17 PM, Larry Baker <ba...@usgs.gov> wrote:

> Paul,
>
> (gdb) print base->nactivequeues
>
>
> seems like an extraordinarily large number to me.  I don't know what the
> implications are of the --enable-debug clang option is.  Any chance the
> SEGFAULT is a debugging trap when an uninitialized value is encountered?
>
> The other thought I had is an alignment trap if, for example,
> nactivequeues is a 64-bit int but is not 64-bit aligned.  As far as I can
> tell, nactivequeues is a plain int.  But, what that is on FreeBSD/amd64, I
> do not know.
>
> Should there be more information in dmesg or a system log file with the
> trap code so you can identify whether it is an instruction fetch (VERY
> unlikely), an operand fetch, or a store that caused the trap?
>
> Larry Baker
> US Geological Survey
> 650-329-5608 <(650)%20329-5608>
> ba...@usgs.gov
>
>
>
> On 30 Aug 2017, at 3:17:05 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
> I am testing the 2.1.2rc3 tarball on FreeBSD-11.1, configured with
>    --prefix=[...] --enable-debug CC=clang CXX=clang++
> --disable-mpi-fortran --with-hwloc=/usr/local
>
> The CC/CXX setting are to use the system default compilers (rather than
> gcc/g++ in /usr/local/bin).
> The --with-hwloc is to avoid issue #3992
> <https://github.com/open-mpi/ompi/issues/3992> (though I have not
> determined if that impacts this RC).
>
> When running ring_c I get a SEGV from orterun, for which a gdb backtrace
> is given below.
> The one surprising thing (highlighted) in the backtrace is that both the
> RHS and LHS of the assignment appear to be valid memory locations.
> So, if the backtrace is accurate then I am at a loss as to why a SEGV
> occurs.
>
> -Paul
>
>
> Program terminated with signal 11, Segmentation fault.
> [...]
> #0  opal_libevent2022_event_assign (ev=0x8065482c0, base=<value optimized
> out>, fd=<value optimized out>,
>     events=2, callback=<value optimized out>, arg=0x0)
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/opal/mca/event/libevent2022/libevent/event.c:1779
> 1779                    ev->ev_pri = base->nactivequeues / 2;
> (gdb) print base->nactivequeues
> $3 = 106201992
> (gdb) print ev->ev_pri
> $4 = 0 '\0'
> (gdb) where
> #0  opal_libevent2022_event_assign (ev=0x8065482c0, base=<value optimized
> out>, fd=<value optimized out>,
>     events=2, callback=<value optimized out>, arg=0x0)
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/opal/mca/event/libevent2022/libevent/event.c:1779
> #1  0x00000008062e1fd2 in pmix_start_progress_thread ()
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix/src/util/progress_threads.c:83
> #2  0x00000008063047e4 in PMIx_server_init (module=0x806545be8,
> info=0x802e16a00, ninfo=2)
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c:310
> #3  0x00000008062c12f6 in pmix1_server_init (module=0x800b106a0,
> info=0x7fffffffe290)
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix1_server_south.c:140
> #4  0x0000000800889f43 in pmix_server_init ()
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/orte/orted/pmix/pmix_server.c:261
> #5  0x0000000803e22d87 in rte_init ()
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/orte/mca/ess/hnp/ess_hnp_module.c:666
> #6  0x000000080084a45e in orte_init (pargc=0x7fffffffe988,
> pargv=0x7fffffffe980, flags=4)
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/orte/runtime/orte_init.c:226
> #7  0x00000000004046a4 in orterun (argc=7, argv=0x7fffffffea18)
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/orte/tools/orterun/orterun.c:831
> #8  0x0000000000403bc2 in main (argc=7, argv=0x7fffffffea18)
>     at /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/
> openmpi-2.1.2rc3/orte/tools/orterun/main.c:13
>
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> <(510)%20495-2352>
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> <(510)%20486-6900>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
>


-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to