Initially I was testing Jeff's tarball for PR 410, on Mac OS X 10.8 where cc is clang, I have configured with --prefix=[...] --enable-debug --enable-osx-builtin-atomics CC=cc CXX=c++
I passed "make check", but when I try to run ring_c I get the first failure shown (far) below. HOWEVER, I tried 50 times to reproduce the failure and could not do so. Since Jeff's tarball is not "official" I turned my attention to the current master tarball instead. I next tried FIVE HUNDRED times with the current master tarball, and was able to reproduce the failure ONCE. The failed assertion and backtrace are different than what I saw before, so they also appear below. Next, I tried with the master tarball without the builtin-atomics configure option. In that case my 95th trial failed and I didn't continue trying. The failure output was (to me) indistinguishable from the one with builtin-atomics, but it is also included below for completeness. Finally, I tried w/o clang leaving only "--prefix=[...] --enable-debug" on the configure command line. However, note that "gcc" is really "i686-apple-darwin11-llvm-gcc-4.2" and thus shares MUCH in common with clang on the same system. This configuration failed too, and the failure output is also provided below. I hope somebody knows how to proceed from here. I don't really have any reason to believe this is specific to Mac OS X, but don't have the spare cycles to dedicate to additional testing. -Paul Seen w/ Jeff's tarball: $ mpirun -mca btl sm,self -np 2 examples/ring_c' Warning :: opal_list_remove_item - the item 0x7fc092a0cb50 is not on the list 0x7fc0928006a0 Assertion failed: (OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (kv))->obj_magic_id), function store, file /Users/Paul/OMPI/openmpi-pr410-v4-macos10.8-x86-clang-atomics/openmpi-gitclone/opal/mca/dstore/hash/dstore_hash.c, line 143. [tesuji:26399] *** Process received signal *** [tesuji:26399] Signal: Abort trap: 6 (6) [tesuji:26399] Signal code: (0) [tesuji:26399] [ 0] 0 libsystem_c.dylib 0x00007fff91e2b90a _sigtramp + 26^@ [tesuji:26399] [ 1] 0 ??? 0x00000000ffffffff 0x0 + 4294967295^@ [tesuji:26399] [ 2] 0 libsystem_c.dylib 0x00007fff91e82f61 abort + 143^@ [tesuji:26399] [ 3] 0 libsystem_c.dylib 0x00007fff91e83cb9 __assert_rtn + 146^@ [tesuji:26399] [ 4] 0 mca_dstore_hash.so 0x000000010180803c store + 972^@ [tesuji:26399] [ 5] 0 libopen-pal.0.dylib 0x00000001016860c6 opal_dstore_base_store + 278^@ [tesuji:26399] [ 6] 0 mca_pmix_native.so 0x0000000101825795 native_get + 4709^@ [tesuji:26399] [ 7] 0 libmpi.0.dylib 0x000000010111f6a4 ompi_proc_complete_init + 980^@ [tesuji:26399] [ 8] 0 libmpi.0.dylib 0x0000000101126f24 ompi_mpi_init + 2372^@ [tesuji:26399] [ 9] 0 libmpi.0.dylib 0x00000001011744c0 MPI_Init + 480^@ [tesuji:26399] [10] 0 ring_c 0x00000001010e9c25 main + 53^@ [tesuji:26399] [11] 0 libdyld.dylib 0x00007fff8e03a7e1 start + 0^@ [tesuji:26399] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal 6 (Abort trap: 6). -------------------------------------------------------------------------- Seen with master tarball and builtin-atomics: $ mpirun -mca btl sm,self -np 2 examples/ring_c' Warning :: opal_list_remove_item - the item 0x7fc6d1900130 is not on the list 0x7fc6d0c30df0 Assertion failed: (0 == item->opal_list_item_refcount), function opal_list_item_destruct, file /Users/Paul/OMPI/openmpi-master-macos10.8-x86-clang-atomics/openmpi-dev-1118-gdc80863/opal/class/opal_list.c, line 69. [tesuji:62565] *** Process received signal *** [tesuji:62565] Signal: Abort trap: 6 (6) [tesuji:62565] Signal code: (0) [tesuji:62565] [ 0] 0 libsystem_c.dylib 0x00007fff91e2b90a _sigtramp + 26^@ [tesuji:62565] [ 1] 0 ??? 0x0000000000000000 0x0 + 0^@ [tesuji:62565] [ 2] 0 libsystem_c.dylib 0x00007fff91e82f61 abort + 143^@ [tesuji:62565] [ 3] 0 libsystem_c.dylib 0x00007fff91e83cb9 __assert_rtn + 146^@ [tesuji:62565] [ 4] 0 libopen-pal.0.dylib 0x0000000107d54dd5 opal_list_item_destruct + 85^@ [tesuji:62565] [ 5] 0 mca_dstore_hash.so 0x0000000107f67e21 opal_obj_run_destructors + 145^@ [tesuji:62565] [ 6] 0 mca_dstore_hash.so 0x0000000107f6707e store + 1054^@ [tesuji:62565] [ 7] 0 libopen-pal.0.dylib 0x0000000107de0336 opal_dstore_base_store + 278^@ [tesuji:62565] [ 8] 0 mca_pmix_native.so 0x0000000107f8aaa3 fencenb_cbfunc + 851^@ [tesuji:62565] [ 9] 0 mca_pmix_native.so 0x0000000107f8bf97 pmix_usock_process_msg + 695^@ [tesuji:62565] [10] 0 libopen-pal.0.dylib 0x0000000107dea38d event_process_active_single_queue + 493^@ [tesuji:62565] [11] 0 libopen-pal.0.dylib 0x0000000107de5f7c event_process_active + 140^@ [tesuji:62565] [12] 0 libopen-pal.0.dylib 0x0000000107de502e opal_libevent2022_event_base_loop + 830^@ [tesuji:62565] [13] 0 libopen-pal.0.dylib 0x0000000107d66532 progress_engine + 66^@ [tesuji:62565] [14] 0 libsystem_c.dylib 0x00007fff91e3d772 _pthread_start + 327^@ [tesuji:62565] [15] 0 libsystem_c.dylib 0x00007fff91e2a1a1 thread_start + 13^@ [tesuji:62565] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal 6 (Abort trap: 6). -------------------------------------------------------------------------- Seen with master tarball and without builtin-atomics: $ mpirun -mca btl sm,self -np 2 examples/ring_c' Warning :: opal_list_remove_item - the item 0x7f8ae2464f00 is not on the list 0x7f8ae2600690 Assertion failed: (0 == item->opal_list_item_refcount), function opal_list_item_destruct, file /Users/Paul/OMPI/openmpi-master-macos10.8-x86-clang/openmpi-dev-1118-gdc80863/opal/class/opal_list.c, line 69. [tesuji:86550] *** Process received signal *** [tesuji:86550] Signal: Abort trap: 6 (6) [tesuji:86550] Signal code: (0) [tesuji:86550] [ 0] 0 libsystem_c.dylib 0x00007fff91e2b90a _sigtramp + 26^@ [tesuji:86550] [ 1] 0 ??? 0x0000000000000000 0x0 + 0^@ [tesuji:86550] [ 2] 0 libsystem_c.dylib 0x00007fff91e82f61 abort + 143^@ [tesuji:86550] [ 3] 0 libsystem_c.dylib 0x00007fff91e83cb9 __assert_rtn + 146^@ [tesuji:86550] [ 4] 0 libopen-pal.0.dylib 0x0000000104e41365 opal_list_item_destruct + 85^@ [tesuji:86550] [ 5] 0 mca_dstore_hash.so 0x0000000105039fc1 opal_obj_run_destructors + 145^@ [tesuji:86550] [ 6] 0 mca_dstore_hash.so 0x000000010503921e store + 1054^@ [tesuji:86550] [ 7] 0 libopen-pal.0.dylib 0x0000000104ec8306 opal_dstore_base_store + 278^@ [tesuji:86550] [ 8] 0 mca_pmix_native.so 0x000000010505bef3 fencenb_cbfunc + 851^@ [tesuji:86550] [ 9] 0 mca_pmix_native.so 0x000000010505d337 pmix_usock_process_msg + 695^@ [tesuji:86550] [10] 0 libopen-pal.0.dylib 0x0000000104ed214d event_process_active_single_queue + 493^@ [tesuji:86550] [11] 0 libopen-pal.0.dylib 0x0000000104ecdd3c event_process_active + 140^@ [tesuji:86550] [12] 0 libopen-pal.0.dylib 0x0000000104eccdee opal_libevent2022_event_base_loop + 830^@ [tesuji:86550] [13] 0 libopen-pal.0.dylib 0x0000000104e521d2 progress_engine + 66^@ [tesuji:86550] [14] 0 libsystem_c.dylib 0x00007fff91e3d772 _pthread_start + 327^@ [tesuji:86550] [15] 0 libsystem_c.dylib 0x00007fff91e2a1a1 thread_start + 13^@ [tesuji:86550] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal 6 (Abort trap: 6). -------------------------------------------------------------------------- Seen on master configured with only --prefix= and --enable-debug: $ mpirun -mca btl sm,self -np 2 examples/ring_c' Warning :: opal_list_remove_item - the item 0x7fd104200130 is not on the list 0x7fd10342b1e0 Assertion failed: (OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (kv))->obj_magic_id), function store, file /Users/Paul/OMPI/openmpi-master-macos10.8-x86-gcc/openmpi-dev-1118-gdc80863/opal/mca/dstore/hash/dstore_hash.c, line 143. [tesuji:12056] *** Process received signal *** [tesuji:12056] Signal: Abort trap: 6 (6) [tesuji:12056] Signal code: (0) [tesuji:12056] [ 0] 0 libsystem_c.dylib 0x00007fff91e2b90a _sigtramp + 26^@ [tesuji:12056] [ 1] 0 ??? 0x20656874202d206d 0x0 + 2334386829826793581^@ [tesuji:12056] [ 2] 0 libsystem_c.dylib 0x00007fff91e82f61 abort + 143^@ [tesuji:12056] [ 3] 0 libsystem_c.dylib 0x00007fff91e83cb9 __assert_rtn + 146^@ [tesuji:12056] [ 4] 0 mca_dstore_hash.so 0x000000010b22cf99 store + 873^@ [tesuji:12056] [ 5] 0 libopen-pal.0.dylib 0x000000010b0c1160 opal_dstore_base_store + 368^@ [tesuji:12056] [ 6] 0 mca_pmix_native.so 0x000000010b250b6f native_get + 6303^@ [tesuji:12056] [ 7] 0 libmpi.0.dylib 0x000000010ac32a9b ompi_proc_complete_init + 1659^@ [tesuji:12056] [ 8] 0 libmpi.0.dylib 0x000000010ac3be8d ompi_mpi_init + 3117^@ [tesuji:12056] [ 9] 0 libmpi.0.dylib 0x000000010ac881c1 MPI_Init + 609^@ [tesuji:12056] [10] 0 ring_c 0x000000010abe6bee main + 46^@ [tesuji:12056] [11] 0 libdyld.dylib 0x00007fff8e03a7e1 start + 0^@ [tesuji:12056] [12] 0 ??? 0x0000000000000001 0x0 + 1^@ [tesuji:12056] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal 6 (Abort trap: 6). -------------------------------------------------------------------------- -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900