Hello, Running CTS with HEAD hanged the cluster after crmd dumped core (abort). It happened after 53 tests with this curious message:
Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR:
mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349:
fsa_our_dc_version != NULL
Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting
untracked process process 19654 dumped core
Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR:
mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!
The cluster looks like this, unchanged for several hours:
============
Last updated: Thu Apr 20 04:43:47 2006
Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
3 Nodes configured.
3 Resources configured.
============
Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online
Resource Group: group_1
IPaddr_1 (heartbeat::ocf:IPaddr): Started sapcl03
LVM_2 (heartbeat::ocf:LVM): Stopped
Filesystem_3 (heartbeat::ocf:Filesystem): Stopped
Resource Group: group_2
IPaddr_2 (heartbeat::ocf:IPaddr): Started sapcl02
LVM_3 (heartbeat::ocf:LVM): Started sapcl02
Filesystem_4 (heartbeat::ocf:Filesystem): Started sapcl02
Resource Group: group_3
IPaddr_3 (heartbeat::ocf:IPaddr): Started sapcl03
LVM_4 (heartbeat::ocf:LVM): Started sapcl03
Filesystem_5 (heartbeat::ocf:Filesystem): Started sapcl03
And:
sapcl01# crmadmin -S sapcl01
Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
All processes are still running on this node, but heartbeat seems
to be in some kind of limbo.
Cheers,
Dejan
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Core was generated by `/usr/lib/heartbeat/crmd'.
Program terminated with signal 6, Aborted.
#0 0xffffe410 in __kernel_vsyscall ()
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x40284581 in raise () from /lib/tls/libc.so.6
#2 0x40285e65 in abort () from /lib/tls/libc.so.6
#3 0x40059488 in crm_abort (file=0x806859d "lrm.c",
function=0x80687c6 "build_operation_update", line=349,
assert_condition=0x806881d "fsa_our_dc_version != NULL", do_fork=1)
at utils.c:1201
#4 0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98,
src=0x80692d9 "do_update_resource", lpc=0) at lrm.c:347
#5 0x0805db31 in do_update_resource (op=0x8282b98) at lrm.c:1383
#6 0x0805e0f7 in do_lrm_event (action=576460752303423488,
cause=C_LRM_OP_CALLBACK, cur_state=S_INTEGRATION, cur_input=I_LRM_EVENT,
msg_data=0x8234d68) at lrm.c:1514
#7 0x0804b572 in do_fsa_action (fsa_data=0x8234d68,
an_action=576460752303423488, function=0x805dc31 <do_lrm_event>)
at fsa.c:178
#8 0x0804c805 in s_crmd_fsa_actions (fsa_data=0x8234d68) at fsa.c:512
#9 0x0804bb36 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:315
#10 0x08055264 in crm_fsa_trigger (user_data=0x0) at callbacks.c:647
#11 0x4002987c in G_TRIG_dispatch (source=0x8072de8, callback=0, user_data=0x0)
at GSource.c:1417
#12 0x400b29ca in g_main_context_dispatch ()
from /opt/gnome/lib/libglib-2.0.so.0
#13 0x400b4adb in g_main_context_iterate ()
from /opt/gnome/lib/libglib-2.0.so.0
#14 0x400b4d07 in g_main_loop_run () from /opt/gnome/lib/libglib-2.0.so.0
#15 0x0804af9b in init_start () at main.c:137
#16 0x0804aec6 in main (argc=1, argv=0xbffff9f4) at main.c:104
cib.xml.gz
Description: Binary data
log.gz
Description: Binary data
_______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
