Re: [OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.

2014-01-09 Thread Adrian Reber
For my CR work this can probably ignored. I think I was looking at the
wrong place.

On Thu, Jan 09, 2014 at 05:28:01PM +0100, Adrian Reber wrote:
> Continuing with the CR code I now get a crash which can be easily reproduced
> using orte/test/system/orte_barrier.c
> 
> I get:
> 
> orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: 
> Assertion `0 == item->opal_list_item_refcount' failed.
> [dcbz:05085] *** Process received signal ***
> [dcbz:05085] Signal: Aborted (6)
> [dcbz:05085] Signal code:  (-6)
> [dcbz:05085] [ 0] /lib64/libpthread.so.0(+0xf750)[0x7f95bca0b750]
> [dcbz:05085] [ 1] /lib64/libc.so.6(gsignal+0x39)[0x7f95bc672c59]
> [dcbz:05085] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95bc674368]
> [dcbz:05085] [ 3] /lib64/libc.so.6(+0x2ebb6)[0x7f95bc66bbb6]
> [dcbz:05085] [ 4] /lib64/libc.so.6(+0x2ec62)[0x7f95bc66bc62]
> [dcbz:05085] [ 5] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86975)[0x7f95bcfbd975]
> [dcbz:05085] [ 6] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86d9a)[0x7f95bcfbdd9a]
> [dcbz:05085] [ 7] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8c831)[0x7f95bcca5831]
> [dcbz:05085] [ 8] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8caa3)[0x7f95bcca5aa3]
> [dcbz:05085] [ 9] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x2c1)[0x7f95bcca611f]
> [dcbz:05085] [10] 
> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x2233b)[0x7f95bcf5933b]
> [dcbz:05085] [11] /lib64/libpthread.so.0(+0x7f33)[0x7f95bca03f33]
> [dcbz:05085] [12] /lib64/libc.so.6(clone+0x6d)[0x7f95bc731ead]
> [dcbz:05085] *** End of error message ***
> --
> orterun noticed that process rank 0 with PID 5085 on node dcbz exited on 
> signal 6 (Aborted).
> --
> 
> and in gdb
> 
> [New LWP 5086]
> [New LWP 5085]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `system/orte_barrier'.
> Program terminated with signal SIGABRT, Aborted.
> #0  0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 56  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> (gdb) bt
> #0  0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x7f95bc6744a8 in __GI_abort () at abort.c:118
> #2  0x7f95bc66bbb6 in __assert_fail_base (fmt=0x7f95bc7b8ea8 "%s%s%s:%u: 
> %s%sAssertion `%s' failed.\n%n", 
> assertion=assertion@entry=0x7f95bd06d6c0 "0 == 
> item->opal_list_item_refcount", 
> file=file@entry=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", 
> line=line@entry=547, 
> function=function@entry=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> 
> "_opal_list_append") at assert.c:92
> #3  0x7f95bc66bc62 in __GI___assert_fail (assertion=0x7f95bd06d6c0 "0 == 
> item->opal_list_item_refcount", 
> file=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=547, 
> function=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") 
> at assert.c:101
> #4  0x7f95bcfbd975 in _opal_list_append (list=0x7f95bd2b9408 
> , item=0x1f35be0, 
> FILE_NAME=0x7f95bd06d718 
> "../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c", LINENO=163)
> at ../../../../../opal/class/opal_list.h:547
> #5  0x7f95bcfbdd9a in process_barrier (fd=-1, args=4, cbdata=0x1f35ed0) 
> at ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:163
> #6  0x7f95bcca5831 in event_process_active_single_queue (base=0x1ef63a0, 
> activeq=0x1ef6360)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
> #7  0x7f95bcca5aa3 in event_process_active (base=0x1ef63a0) at 
> ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
> #8  0x7f95bcca611f in opal_libevent2021_event_base_loop (base=0x1ef63a0, 
> flags=1)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
> #9  0x7f95bcf5933b in orte_progress_thread_engine (obj=0x7f95bd2b9160 
> ) at ../../orte/runtime/orte_init.c:180
> #10 0x7f95bca03f33 in start_thread (arg=0x7f95bbb0d700) at 
> pthread_create.c:309
> #11 0x7f95bc731ead in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) 
> 
> As far as I understand it seems to call opal_list_append() twice in
> orte/mca/grpcomm/bad/grpcomm_bad_module.c:163
> 
> opal_list_append(_grpcomm_base.active_colls, >super);
> 
> I have no idea how to fix this.
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


[OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.

2014-01-09 Thread Adrian Reber
Continuing with the CR code I now get a crash which can be easily reproduced
using orte/test/system/orte_barrier.c

I get:

orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: 
Assertion `0 == item->opal_list_item_refcount' failed.
[dcbz:05085] *** Process received signal ***
[dcbz:05085] Signal: Aborted (6)
[dcbz:05085] Signal code:  (-6)
[dcbz:05085] [ 0] /lib64/libpthread.so.0(+0xf750)[0x7f95bca0b750]
[dcbz:05085] [ 1] /lib64/libc.so.6(gsignal+0x39)[0x7f95bc672c59]
[dcbz:05085] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95bc674368]
[dcbz:05085] [ 3] /lib64/libc.so.6(+0x2ebb6)[0x7f95bc66bbb6]
[dcbz:05085] [ 4] /lib64/libc.so.6(+0x2ec62)[0x7f95bc66bc62]
[dcbz:05085] [ 5] 
/home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86975)[0x7f95bcfbd975]
[dcbz:05085] [ 6] 
/home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86d9a)[0x7f95bcfbdd9a]
[dcbz:05085] [ 7] 
/home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8c831)[0x7f95bcca5831]
[dcbz:05085] [ 8] 
/home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8caa3)[0x7f95bcca5aa3]
[dcbz:05085] [ 9] 
/home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x2c1)[0x7f95bcca611f]
[dcbz:05085] [10] 
/home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x2233b)[0x7f95bcf5933b]
[dcbz:05085] [11] /lib64/libpthread.so.0(+0x7f33)[0x7f95bca03f33]
[dcbz:05085] [12] /lib64/libc.so.6(clone+0x6d)[0x7f95bc731ead]
[dcbz:05085] *** End of error message ***
--
orterun noticed that process rank 0 with PID 5085 on node dcbz exited on signal 
6 (Aborted).
--

and in gdb

[New LWP 5086]
[New LWP 5085]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `system/orte_barrier'.
Program terminated with signal SIGABRT, Aborted.
#0  0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f95bc6744a8 in __GI_abort () at abort.c:118
#2  0x7f95bc66bbb6 in __assert_fail_base (fmt=0x7f95bc7b8ea8 "%s%s%s:%u: 
%s%sAssertion `%s' failed.\n%n", 
assertion=assertion@entry=0x7f95bd06d6c0 "0 == 
item->opal_list_item_refcount", 
file=file@entry=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", 
line=line@entry=547, 
function=function@entry=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> 
"_opal_list_append") at assert.c:92
#3  0x7f95bc66bc62 in __GI___assert_fail (assertion=0x7f95bd06d6c0 "0 == 
item->opal_list_item_refcount", 
file=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=547, 
function=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") at 
assert.c:101
#4  0x7f95bcfbd975 in _opal_list_append (list=0x7f95bd2b9408 
, item=0x1f35be0, 
FILE_NAME=0x7f95bd06d718 
"../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c", LINENO=163)
at ../../../../../opal/class/opal_list.h:547
#5  0x7f95bcfbdd9a in process_barrier (fd=-1, args=4, cbdata=0x1f35ed0) at 
../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:163
#6  0x7f95bcca5831 in event_process_active_single_queue (base=0x1ef63a0, 
activeq=0x1ef6360)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
#7  0x7f95bcca5aa3 in event_process_active (base=0x1ef63a0) at 
../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#8  0x7f95bcca611f in opal_libevent2021_event_base_loop (base=0x1ef63a0, 
flags=1)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#9  0x7f95bcf5933b in orte_progress_thread_engine (obj=0x7f95bd2b9160 
) at ../../orte/runtime/orte_init.c:180
#10 0x7f95bca03f33 in start_thread (arg=0x7f95bbb0d700) at 
pthread_create.c:309
#11 0x7f95bc731ead in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 

As far as I understand it seems to call opal_list_append() twice in
orte/mca/grpcomm/bad/grpcomm_bad_module.c:163

opal_list_append(_grpcomm_base.active_colls, >super);

I have no idea how to fix this.

Adrian