Re: [OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.
Not sure I grok - are you saying you believe the assert is bogus? We haven't see it elsewhere, but perhaps this is happening only with c/r config and running? I'm happy to take a look if you can provide more specifics as to how it can be made to happen On Jan 9, 2014, at 2:46 PM, Adrian Reber wrote: > For my CR work this can probably ignored. I think I was looking at the > wrong place. > > On Thu, Jan 09, 2014 at 05:28:01PM +0100, Adrian Reber wrote: >> Continuing with the CR code I now get a crash which can be easily reproduced >> using orte/test/system/orte_barrier.c >> >> I get: >> >> orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: >> Assertion `0 == item->opal_list_item_refcount' failed. >> [dcbz:05085] *** Process received signal *** >> [dcbz:05085] Signal: Aborted (6) >> [dcbz:05085] Signal code: (-6) >> [dcbz:05085] [ 0] /lib64/libpthread.so.0(+0xf750)[0x7f95bca0b750] >> [dcbz:05085] [ 1] /lib64/libc.so.6(gsignal+0x39)[0x7f95bc672c59] >> [dcbz:05085] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95bc674368] >> [dcbz:05085] [ 3] /lib64/libc.so.6(+0x2ebb6)[0x7f95bc66bbb6] >> [dcbz:05085] [ 4] /lib64/libc.so.6(+0x2ec62)[0x7f95bc66bc62] >> [dcbz:05085] [ 5] >> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86975)[0x7f95bcfbd975] >> [dcbz:05085] [ 6] >> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86d9a)[0x7f95bcfbdd9a] >> [dcbz:05085] [ 7] >> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8c831)[0x7f95bcca5831] >> [dcbz:05085] [ 8] >> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8caa3)[0x7f95bcca5aa3] >> [dcbz:05085] [ 9] >> /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x2c1)[0x7f95bcca611f] >> [dcbz:05085] [10] >> /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x2233b)[0x7f95bcf5933b] >> [dcbz:05085] [11] /lib64/libpthread.so.0(+0x7f33)[0x7f95bca03f33] >> [dcbz:05085] [12] /lib64/libc.so.6(clone+0x6d)[0x7f95bc731ead] >> [dcbz:05085] *** End of error message *** >> -- >> orterun noticed that process rank 0 with PID 5085 on node dcbz exited on >> signal 6 (Aborted). >> -- >> >> and in gdb >> >> [New LWP 5086] >> [New LWP 5085] >> [Thread debugging using libthread_db enabled] >> Using host libthread_db library "/lib64/libthread_db.so.1". >> Core was generated by `system/orte_barrier'. >> Program terminated with signal SIGABRT, Aborted. >> #0 0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at >> ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); >> (gdb) bt >> #0 0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at >> ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> #1 0x7f95bc6744a8 in __GI_abort () at abort.c:118 >> #2 0x7f95bc66bbb6 in __assert_fail_base (fmt=0x7f95bc7b8ea8 "%s%s%s:%u: >> %s%sAssertion `%s' failed.\n%n", >>assertion=assertion@entry=0x7f95bd06d6c0 "0 == >> item->opal_list_item_refcount", >>file=file@entry=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", >> line=line@entry=547, >>function=function@entry=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> >> "_opal_list_append") at assert.c:92 >> #3 0x7f95bc66bc62 in __GI___assert_fail (assertion=0x7f95bd06d6c0 "0 == >> item->opal_list_item_refcount", >>file=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=547, >>function=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") >> at assert.c:101 >> #4 0x7f95bcfbd975 in _opal_list_append (list=0x7f95bd2b9408 >> , item=0x1f35be0, >>FILE_NAME=0x7f95bd06d718 >> "../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c", LINENO=163) >>at ../../../../../opal/class/opal_list.h:547 >> #5 0x7f95bcfbdd9a in process_barrier (fd=-1, args=4, cbdata=0x1f35ed0) >> at ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:163 >> #6 0x7f95bcca5831 in event_process_active_single_queue (base=0x1ef63a0, >> activeq=0x1ef6360) >>at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367 >> #7 0x7f95bcca5aa3 in event_process_active (base=0x1ef63a0) at >> ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437 >> #8 0x7f95bcca611f in opal_libevent2021_event_base_loop (base=0x1ef63a0, >> flags=1) >>at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645 >> #9 0x7f95bcf5933b in orte_progress_thread_engine (obj=0x7f95bd2b9160 >> ) at ../../orte/runtime/orte_init.c:180 >> #10 0x7f95bca03f33 in start_thread (arg=0x7f95bbb0d700) at >> pthread_create.c:309 >> #11 0x7f95bc731ead in clone () at >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >> (gdb) >> >> As far as I understand it seems to call opal_list_append() twice in >> orte/mca/grpcomm/bad/grpcomm_bad_module.c:163 >> >> opal_list_append(&orte_grpcomm_base.activ
Re: [OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.
For my CR work this can probably ignored. I think I was looking at the wrong place. On Thu, Jan 09, 2014 at 05:28:01PM +0100, Adrian Reber wrote: > Continuing with the CR code I now get a crash which can be easily reproduced > using orte/test/system/orte_barrier.c > > I get: > > orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: > Assertion `0 == item->opal_list_item_refcount' failed. > [dcbz:05085] *** Process received signal *** > [dcbz:05085] Signal: Aborted (6) > [dcbz:05085] Signal code: (-6) > [dcbz:05085] [ 0] /lib64/libpthread.so.0(+0xf750)[0x7f95bca0b750] > [dcbz:05085] [ 1] /lib64/libc.so.6(gsignal+0x39)[0x7f95bc672c59] > [dcbz:05085] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95bc674368] > [dcbz:05085] [ 3] /lib64/libc.so.6(+0x2ebb6)[0x7f95bc66bbb6] > [dcbz:05085] [ 4] /lib64/libc.so.6(+0x2ec62)[0x7f95bc66bc62] > [dcbz:05085] [ 5] > /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86975)[0x7f95bcfbd975] > [dcbz:05085] [ 6] > /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86d9a)[0x7f95bcfbdd9a] > [dcbz:05085] [ 7] > /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8c831)[0x7f95bcca5831] > [dcbz:05085] [ 8] > /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8caa3)[0x7f95bcca5aa3] > [dcbz:05085] [ 9] > /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x2c1)[0x7f95bcca611f] > [dcbz:05085] [10] > /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x2233b)[0x7f95bcf5933b] > [dcbz:05085] [11] /lib64/libpthread.so.0(+0x7f33)[0x7f95bca03f33] > [dcbz:05085] [12] /lib64/libc.so.6(clone+0x6d)[0x7f95bc731ead] > [dcbz:05085] *** End of error message *** > -- > orterun noticed that process rank 0 with PID 5085 on node dcbz exited on > signal 6 (Aborted). > -- > > and in gdb > > [New LWP 5086] > [New LWP 5085] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > Core was generated by `system/orte_barrier'. > Program terminated with signal SIGABRT, Aborted. > #0 0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); > (gdb) bt > #0 0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > #1 0x7f95bc6744a8 in __GI_abort () at abort.c:118 > #2 0x7f95bc66bbb6 in __assert_fail_base (fmt=0x7f95bc7b8ea8 "%s%s%s:%u: > %s%sAssertion `%s' failed.\n%n", > assertion=assertion@entry=0x7f95bd06d6c0 "0 == > item->opal_list_item_refcount", > file=file@entry=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", > line=line@entry=547, > function=function@entry=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> > "_opal_list_append") at assert.c:92 > #3 0x7f95bc66bc62 in __GI___assert_fail (assertion=0x7f95bd06d6c0 "0 == > item->opal_list_item_refcount", > file=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=547, > function=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") > at assert.c:101 > #4 0x7f95bcfbd975 in _opal_list_append (list=0x7f95bd2b9408 > , item=0x1f35be0, > FILE_NAME=0x7f95bd06d718 > "../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c", LINENO=163) > at ../../../../../opal/class/opal_list.h:547 > #5 0x7f95bcfbdd9a in process_barrier (fd=-1, args=4, cbdata=0x1f35ed0) > at ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:163 > #6 0x7f95bcca5831 in event_process_active_single_queue (base=0x1ef63a0, > activeq=0x1ef6360) > at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367 > #7 0x7f95bcca5aa3 in event_process_active (base=0x1ef63a0) at > ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437 > #8 0x7f95bcca611f in opal_libevent2021_event_base_loop (base=0x1ef63a0, > flags=1) > at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645 > #9 0x7f95bcf5933b in orte_progress_thread_engine (obj=0x7f95bd2b9160 > ) at ../../orte/runtime/orte_init.c:180 > #10 0x7f95bca03f33 in start_thread (arg=0x7f95bbb0d700) at > pthread_create.c:309 > #11 0x7f95bc731ead in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > (gdb) > > As far as I understand it seems to call opal_list_append() twice in > orte/mca/grpcomm/bad/grpcomm_bad_module.c:163 > > opal_list_append(&orte_grpcomm_base.active_colls, &coll->super); > > I have no idea how to fix this. > > Adrian > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.
Continuing with the CR code I now get a crash which can be easily reproduced using orte/test/system/orte_barrier.c I get: orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed. [dcbz:05085] *** Process received signal *** [dcbz:05085] Signal: Aborted (6) [dcbz:05085] Signal code: (-6) [dcbz:05085] [ 0] /lib64/libpthread.so.0(+0xf750)[0x7f95bca0b750] [dcbz:05085] [ 1] /lib64/libc.so.6(gsignal+0x39)[0x7f95bc672c59] [dcbz:05085] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95bc674368] [dcbz:05085] [ 3] /lib64/libc.so.6(+0x2ebb6)[0x7f95bc66bbb6] [dcbz:05085] [ 4] /lib64/libc.so.6(+0x2ec62)[0x7f95bc66bc62] [dcbz:05085] [ 5] /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86975)[0x7f95bcfbd975] [dcbz:05085] [ 6] /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86d9a)[0x7f95bcfbdd9a] [dcbz:05085] [ 7] /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8c831)[0x7f95bcca5831] [dcbz:05085] [ 8] /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8caa3)[0x7f95bcca5aa3] [dcbz:05085] [ 9] /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x2c1)[0x7f95bcca611f] [dcbz:05085] [10] /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x2233b)[0x7f95bcf5933b] [dcbz:05085] [11] /lib64/libpthread.so.0(+0x7f33)[0x7f95bca03f33] [dcbz:05085] [12] /lib64/libc.so.6(clone+0x6d)[0x7f95bc731ead] [dcbz:05085] *** End of error message *** -- orterun noticed that process rank 0 with PID 5085 on node dcbz exited on signal 6 (Aborted). -- and in gdb [New LWP 5086] [New LWP 5085] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `system/orte_barrier'. Program terminated with signal SIGABRT, Aborted. #0 0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt #0 0x7f95bc672c59 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f95bc6744a8 in __GI_abort () at abort.c:118 #2 0x7f95bc66bbb6 in __assert_fail_base (fmt=0x7f95bc7b8ea8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f95bd06d6c0 "0 == item->opal_list_item_refcount", file=file@entry=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=line@entry=547, function=function@entry=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") at assert.c:92 #3 0x7f95bc66bc62 in __GI___assert_fail (assertion=0x7f95bd06d6c0 "0 == item->opal_list_item_refcount", file=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=547, function=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") at assert.c:101 #4 0x7f95bcfbd975 in _opal_list_append (list=0x7f95bd2b9408 , item=0x1f35be0, FILE_NAME=0x7f95bd06d718 "../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c", LINENO=163) at ../../../../../opal/class/opal_list.h:547 #5 0x7f95bcfbdd9a in process_barrier (fd=-1, args=4, cbdata=0x1f35ed0) at ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:163 #6 0x7f95bcca5831 in event_process_active_single_queue (base=0x1ef63a0, activeq=0x1ef6360) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367 #7 0x7f95bcca5aa3 in event_process_active (base=0x1ef63a0) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437 #8 0x7f95bcca611f in opal_libevent2021_event_base_loop (base=0x1ef63a0, flags=1) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645 #9 0x7f95bcf5933b in orte_progress_thread_engine (obj=0x7f95bd2b9160 ) at ../../orte/runtime/orte_init.c:180 #10 0x7f95bca03f33 in start_thread (arg=0x7f95bbb0d700) at pthread_create.c:309 #11 0x7f95bc731ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) As far as I understand it seems to call opal_list_append() twice in orte/mca/grpcomm/bad/grpcomm_bad_module.c:163 opal_list_append(&orte_grpcomm_base.active_colls, &coll->super); I have no idea how to fix this. Adrian