Hello, Running CTS with HEAD hanged the cluster after crmd dumped core (abort). It happened after 53 tests with this curious message:
Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: fsa_our_dc_version != NULL Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting untracked process process 19654 dumped core Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped! The cluster looks like this, unchanged for several hours: ============ Last updated: Thu Apr 20 04:43:47 2006 Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d) 3 Nodes configured. 3 Resources configured. ============ Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online Resource Group: group_1 IPaddr_1 (heartbeat::ocf:IPaddr): Started sapcl03 LVM_2 (heartbeat::ocf:LVM): Stopped Filesystem_3 (heartbeat::ocf:Filesystem): Stopped Resource Group: group_2 IPaddr_2 (heartbeat::ocf:IPaddr): Started sapcl02 LVM_3 (heartbeat::ocf:LVM): Started sapcl02 Filesystem_4 (heartbeat::ocf:Filesystem): Started sapcl02 Resource Group: group_3 IPaddr_3 (heartbeat::ocf:IPaddr): Started sapcl03 LVM_4 (heartbeat::ocf:LVM): Started sapcl03 Filesystem_5 (heartbeat::ocf:Filesystem): Started sapcl03 And: sapcl01# crmadmin -S sapcl01 Status of [EMAIL PROTECTED]: S_TERMINATE (ok) All processes are still running on this node, but heartbeat seems to be in some kind of limbo. Cheers, Dejan
Using host libthread_db library "/lib/tls/libthread_db.so.1". Core was generated by `/usr/lib/heartbeat/crmd'. Program terminated with signal 6, Aborted. #0 0xffffe410 in __kernel_vsyscall () #0 0xffffe410 in __kernel_vsyscall () #1 0x40284581 in raise () from /lib/tls/libc.so.6 #2 0x40285e65 in abort () from /lib/tls/libc.so.6 #3 0x40059488 in crm_abort (file=0x806859d "lrm.c", function=0x80687c6 "build_operation_update", line=349, assert_condition=0x806881d "fsa_our_dc_version != NULL", do_fork=1) at utils.c:1201 #4 0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, src=0x80692d9 "do_update_resource", lpc=0) at lrm.c:347 #5 0x0805db31 in do_update_resource (op=0x8282b98) at lrm.c:1383 #6 0x0805e0f7 in do_lrm_event (action=576460752303423488, cause=C_LRM_OP_CALLBACK, cur_state=S_INTEGRATION, cur_input=I_LRM_EVENT, msg_data=0x8234d68) at lrm.c:1514 #7 0x0804b572 in do_fsa_action (fsa_data=0x8234d68, an_action=576460752303423488, function=0x805dc31 <do_lrm_event>) at fsa.c:178 #8 0x0804c805 in s_crmd_fsa_actions (fsa_data=0x8234d68) at fsa.c:512 #9 0x0804bb36 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:315 #10 0x08055264 in crm_fsa_trigger (user_data=0x0) at callbacks.c:647 #11 0x4002987c in G_TRIG_dispatch (source=0x8072de8, callback=0, user_data=0x0) at GSource.c:1417 #12 0x400b29ca in g_main_context_dispatch () from /opt/gnome/lib/libglib-2.0.so.0 #13 0x400b4adb in g_main_context_iterate () from /opt/gnome/lib/libglib-2.0.so.0 #14 0x400b4d07 in g_main_loop_run () from /opt/gnome/lib/libglib-2.0.so.0 #15 0x0804af9b in init_start () at main.c:137 #16 0x0804aec6 in main (argc=1, argv=0xbffff9f4) at main.c:104
cib.xml.gz
Description: Binary data
log.gz
Description: Binary data
_______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/