Initially I was testing Jeff's tarball for PR 410, on Mac OS X 10.8 where
cc is clang, I have configured with
    --prefix=[...] --enable-debug --enable-osx-builtin-atomics CC=cc CXX=c++

I passed "make check", but when I try to run ring_c I get the first failure
shown (far) below.
HOWEVER, I tried 50 times to reproduce the failure and could not do so.
Since Jeff's tarball is not "official" I turned my attention to the current
master tarball instead.

I next tried FIVE HUNDRED times with the current master tarball, and was
able to reproduce the failure ONCE.
The failed assertion and backtrace are different than what I saw before, so
they also appear below.

Next, I tried with the master tarball without the builtin-atomics configure
option.
In that case my 95th trial failed and I didn't continue trying.
The failure output was (to me) indistinguishable from the one with
builtin-atomics, but it is also included below for completeness.

Finally, I tried w/o clang leaving only "--prefix=[...] --enable-debug" on
the configure command line.
However, note that "gcc" is really "i686-apple-darwin11-llvm-gcc-4.2" and
thus shares MUCH in common with clang on the same system.
This configuration failed too, and the failure output is also provided
below.

I hope somebody knows how to proceed from here.
I don't really have any reason to believe this is specific to Mac OS X, but
don't have the spare cycles to dedicate to additional testing.

-Paul

Seen w/ Jeff's tarball:

$ mpirun -mca btl sm,self -np 2 examples/ring_c'
 Warning :: opal_list_remove_item - the item 0x7fc092a0cb50 is not on the
list 0x7fc0928006a0
Assertion failed: (OPAL_OBJ_MAGIC_ID == ((opal_object_t *)
(kv))->obj_magic_id), function store, file
/Users/Paul/OMPI/openmpi-pr410-v4-macos10.8-x86-clang-atomics/openmpi-gitclone/opal/mca/dstore/hash/dstore_hash.c,
line 143.
[tesuji:26399] *** Process received signal ***
[tesuji:26399] Signal: Abort trap: 6 (6)
[tesuji:26399] Signal code:  (0)
[tesuji:26399] [ 0] 0   libsystem_c.dylib
0x00007fff91e2b90a _sigtramp + 26^@
[tesuji:26399] [ 1] 0   ???
0x00000000ffffffff 0x0 + 4294967295^@
[tesuji:26399] [ 2] 0   libsystem_c.dylib
0x00007fff91e82f61 abort + 143^@
[tesuji:26399] [ 3] 0   libsystem_c.dylib
0x00007fff91e83cb9 __assert_rtn + 146^@
[tesuji:26399] [ 4] 0   mca_dstore_hash.so
 0x000000010180803c store + 972^@
[tesuji:26399] [ 5] 0   libopen-pal.0.dylib
0x00000001016860c6 opal_dstore_base_store + 278^@
[tesuji:26399] [ 6] 0   mca_pmix_native.so
 0x0000000101825795 native_get + 4709^@
[tesuji:26399] [ 7] 0   libmpi.0.dylib
 0x000000010111f6a4 ompi_proc_complete_init + 980^@
[tesuji:26399] [ 8] 0   libmpi.0.dylib
 0x0000000101126f24 ompi_mpi_init + 2372^@
[tesuji:26399] [ 9] 0   libmpi.0.dylib
 0x00000001011744c0 MPI_Init + 480^@
[tesuji:26399] [10] 0   ring_c
 0x00000001010e9c25 main + 53^@
[tesuji:26399] [11] 0   libdyld.dylib
0x00007fff8e03a7e1 start + 0^@
[tesuji:26399] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node tesuji exited on
signal 6 (Abort trap: 6).
--------------------------------------------------------------------------

Seen with master tarball and builtin-atomics:

$ mpirun -mca btl sm,self -np 2 examples/ring_c'
 Warning :: opal_list_remove_item - the item 0x7fc6d1900130 is not on the
list 0x7fc6d0c30df0
Assertion failed: (0 == item->opal_list_item_refcount), function
opal_list_item_destruct, file
/Users/Paul/OMPI/openmpi-master-macos10.8-x86-clang-atomics/openmpi-dev-1118-gdc80863/opal/class/opal_list.c,
line 69.
[tesuji:62565] *** Process received signal ***
[tesuji:62565] Signal: Abort trap: 6 (6)
[tesuji:62565] Signal code:  (0)
[tesuji:62565] [ 0] 0   libsystem_c.dylib
0x00007fff91e2b90a _sigtramp + 26^@
[tesuji:62565] [ 1] 0   ???
0x0000000000000000 0x0 + 0^@
[tesuji:62565] [ 2] 0   libsystem_c.dylib
0x00007fff91e82f61 abort + 143^@
[tesuji:62565] [ 3] 0   libsystem_c.dylib
0x00007fff91e83cb9 __assert_rtn + 146^@
[tesuji:62565] [ 4] 0   libopen-pal.0.dylib
0x0000000107d54dd5 opal_list_item_destruct + 85^@
[tesuji:62565] [ 5] 0   mca_dstore_hash.so
 0x0000000107f67e21 opal_obj_run_destructors + 145^@
[tesuji:62565] [ 6] 0   mca_dstore_hash.so
 0x0000000107f6707e store + 1054^@
[tesuji:62565] [ 7] 0   libopen-pal.0.dylib
0x0000000107de0336 opal_dstore_base_store + 278^@
[tesuji:62565] [ 8] 0   mca_pmix_native.so
 0x0000000107f8aaa3 fencenb_cbfunc + 851^@
[tesuji:62565] [ 9] 0   mca_pmix_native.so
 0x0000000107f8bf97 pmix_usock_process_msg + 695^@
[tesuji:62565] [10] 0   libopen-pal.0.dylib
0x0000000107dea38d event_process_active_single_queue + 493^@
[tesuji:62565] [11] 0   libopen-pal.0.dylib
0x0000000107de5f7c event_process_active + 140^@
[tesuji:62565] [12] 0   libopen-pal.0.dylib
0x0000000107de502e opal_libevent2022_event_base_loop + 830^@
[tesuji:62565] [13] 0   libopen-pal.0.dylib
0x0000000107d66532 progress_engine + 66^@
[tesuji:62565] [14] 0   libsystem_c.dylib
0x00007fff91e3d772 _pthread_start + 327^@
[tesuji:62565] [15] 0   libsystem_c.dylib
0x00007fff91e2a1a1 thread_start + 13^@
[tesuji:62565] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node tesuji exited on
signal 6 (Abort trap: 6).
--------------------------------------------------------------------------

Seen with master tarball and without builtin-atomics:

$ mpirun -mca btl sm,self -np 2 examples/ring_c'
 Warning :: opal_list_remove_item - the item 0x7f8ae2464f00 is not on the
list 0x7f8ae2600690
Assertion failed: (0 == item->opal_list_item_refcount), function
opal_list_item_destruct, file
/Users/Paul/OMPI/openmpi-master-macos10.8-x86-clang/openmpi-dev-1118-gdc80863/opal/class/opal_list.c,
line 69.
[tesuji:86550] *** Process received signal ***
[tesuji:86550] Signal: Abort trap: 6 (6)
[tesuji:86550] Signal code:  (0)
[tesuji:86550] [ 0] 0   libsystem_c.dylib
0x00007fff91e2b90a _sigtramp + 26^@
[tesuji:86550] [ 1] 0   ???
0x0000000000000000 0x0 + 0^@
[tesuji:86550] [ 2] 0   libsystem_c.dylib
0x00007fff91e82f61 abort + 143^@
[tesuji:86550] [ 3] 0   libsystem_c.dylib
0x00007fff91e83cb9 __assert_rtn + 146^@
[tesuji:86550] [ 4] 0   libopen-pal.0.dylib
0x0000000104e41365 opal_list_item_destruct + 85^@
[tesuji:86550] [ 5] 0   mca_dstore_hash.so
 0x0000000105039fc1 opal_obj_run_destructors + 145^@
[tesuji:86550] [ 6] 0   mca_dstore_hash.so
 0x000000010503921e store + 1054^@
[tesuji:86550] [ 7] 0   libopen-pal.0.dylib
0x0000000104ec8306 opal_dstore_base_store + 278^@
[tesuji:86550] [ 8] 0   mca_pmix_native.so
 0x000000010505bef3 fencenb_cbfunc + 851^@
[tesuji:86550] [ 9] 0   mca_pmix_native.so
 0x000000010505d337 pmix_usock_process_msg + 695^@
[tesuji:86550] [10] 0   libopen-pal.0.dylib
0x0000000104ed214d event_process_active_single_queue + 493^@
[tesuji:86550] [11] 0   libopen-pal.0.dylib
0x0000000104ecdd3c event_process_active + 140^@
[tesuji:86550] [12] 0   libopen-pal.0.dylib
0x0000000104eccdee opal_libevent2022_event_base_loop + 830^@
[tesuji:86550] [13] 0   libopen-pal.0.dylib
0x0000000104e521d2 progress_engine + 66^@
[tesuji:86550] [14] 0   libsystem_c.dylib
0x00007fff91e3d772 _pthread_start + 327^@
[tesuji:86550] [15] 0   libsystem_c.dylib
0x00007fff91e2a1a1 thread_start + 13^@
[tesuji:86550] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node tesuji exited on
signal 6 (Abort trap: 6).
--------------------------------------------------------------------------

Seen on master configured with only --prefix= and --enable-debug:

$ mpirun -mca btl sm,self -np 2 examples/ring_c'
 Warning :: opal_list_remove_item - the item 0x7fd104200130 is not on the
list 0x7fd10342b1e0
Assertion failed: (OPAL_OBJ_MAGIC_ID == ((opal_object_t *)
(kv))->obj_magic_id), function store, file
/Users/Paul/OMPI/openmpi-master-macos10.8-x86-gcc/openmpi-dev-1118-gdc80863/opal/mca/dstore/hash/dstore_hash.c,
line 143.
[tesuji:12056] *** Process received signal ***
[tesuji:12056] Signal: Abort trap: 6 (6)
[tesuji:12056] Signal code:  (0)
[tesuji:12056] [ 0] 0   libsystem_c.dylib
0x00007fff91e2b90a _sigtramp + 26^@
[tesuji:12056] [ 1] 0   ???
0x20656874202d206d 0x0 + 2334386829826793581^@
[tesuji:12056] [ 2] 0   libsystem_c.dylib
0x00007fff91e82f61 abort + 143^@
[tesuji:12056] [ 3] 0   libsystem_c.dylib
0x00007fff91e83cb9 __assert_rtn + 146^@
[tesuji:12056] [ 4] 0   mca_dstore_hash.so
 0x000000010b22cf99 store + 873^@
[tesuji:12056] [ 5] 0   libopen-pal.0.dylib
0x000000010b0c1160 opal_dstore_base_store + 368^@
[tesuji:12056] [ 6] 0   mca_pmix_native.so
 0x000000010b250b6f native_get + 6303^@
[tesuji:12056] [ 7] 0   libmpi.0.dylib
 0x000000010ac32a9b ompi_proc_complete_init + 1659^@
[tesuji:12056] [ 8] 0   libmpi.0.dylib
 0x000000010ac3be8d ompi_mpi_init + 3117^@
[tesuji:12056] [ 9] 0   libmpi.0.dylib
 0x000000010ac881c1 MPI_Init + 609^@
[tesuji:12056] [10] 0   ring_c
 0x000000010abe6bee main + 46^@
[tesuji:12056] [11] 0   libdyld.dylib
0x00007fff8e03a7e1 start + 0^@
[tesuji:12056] [12] 0   ???
0x0000000000000001 0x0 + 1^@
[tesuji:12056] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node tesuji exited on
signal 6 (Abort trap: 6).
--------------------------------------------------------------------------

-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to