Re: [Linux-ha-dev] core dump (abort) in crmd: untracked process (HEAD)
On 4/20/06, Dejan Muhamedagic [EMAIL PROTECTED] wrote: Hello, Running CTS with HEAD hanged the cluster after crmd dumped core (abort). It happened after 53 tests with this curious message: Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: fsa_our_dc_version != NULL We have two kinds of asserts... neither are supposed to happen and both create a core file so that we can diagnose how we got there. However non-fatal ones call fork first (so the main process doesn't die) and then take some recovery action. Sometimes the non-fatal varieties are used in new pieces of code to make sure they work as we expect and that is what has happened here. Do you still have the core file? I'd be interested to know the result of: print *op from frame #4 In the meantime, I'll look at the logs and see what I can figure out. Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting untracked process process 19654 dumped core Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped! The cluster looks like this, unchanged for several hours: Last updated: Thu Apr 20 04:43:47 2006 Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d) 3 Nodes configured. 3 Resources configured. Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online Resource Group: group_1 IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03 LVM_2 (heartbeat::ocf:LVM): Stopped Filesystem_3(heartbeat::ocf:Filesystem):Stopped Resource Group: group_2 IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02 LVM_3 (heartbeat::ocf:LVM): Started sapcl02 Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02 Resource Group: group_3 IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03 LVM_4 (heartbeat::ocf:LVM): Started sapcl03 Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03 And: sapcl01# crmadmin -S sapcl01 Status of [EMAIL PROTECTED]: S_TERMINATE (ok) All processes are still running on this node, but heartbeat seems to be in some kind of limbo. Cheers, Dejan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] core dump (abort) in crmd: untracked process (HEAD)
On 4/20/06, Andrew Beekhof [EMAIL PROTECTED] wrote: On 4/20/06, Dejan Muhamedagic [EMAIL PROTECTED] wrote: Hello, Running CTS with HEAD hanged the cluster after crmd dumped core (abort). It happened after 53 tests with this curious message: Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: fsa_our_dc_version != NULL We have two kinds of asserts... neither are supposed to happen and both create a core file so that we can diagnose how we got there. However non-fatal ones call fork first (so the main process doesn't die) and then take some recovery action. Sometimes the non-fatal varieties are used in new pieces of code to make sure they work as we expect and that is what has happened here. Do you still have the core file? I'd be interested to know the result of: print *op from frame #4 In the meantime, I'll look at the logs and see what I can figure out. There is also a script Alan wrote to easily extract test data: /usr/lib/heartbeat/cts/extracttests.py Can you tell me what test was being performed at the time you hit the assert? Do you have logs from further back? Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting untracked process process 19654 dumped core Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped! The cluster looks like this, unchanged for several hours: Last updated: Thu Apr 20 04:43:47 2006 Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d) 3 Nodes configured. 3 Resources configured. Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online Resource Group: group_1 IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03 LVM_2 (heartbeat::ocf:LVM): Stopped Filesystem_3(heartbeat::ocf:Filesystem):Stopped Resource Group: group_2 IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02 LVM_3 (heartbeat::ocf:LVM): Started sapcl02 Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02 Resource Group: group_3 IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03 LVM_4 (heartbeat::ocf:LVM): Started sapcl03 Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03 And: sapcl01# crmadmin -S sapcl01 Status of [EMAIL PROTECTED]: S_TERMINATE (ok) All processes are still running on this node, but heartbeat seems to be in some kind of limbo. Cheers, Dejan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] core dump (abort) in crmd: untracked process (HEAD)
Hi, #4 0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, src=0x80692d9 do_update_resource, lpc=0) at lrm.c:347 (gdb) print *0x8282b98 $1 = 136448640 If you want I can send you the core off list. I keep all the cores :) Cheers, Dejan On Thu, Apr 20, 2006 at 09:15:28AM +0200, Andrew Beekhof wrote: On 4/20/06, Dejan Muhamedagic [EMAIL PROTECTED] wrote: Hello, Running CTS with HEAD hanged the cluster after crmd dumped core (abort). It happened after 53 tests with this curious message: Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: fsa_our_dc_version != NULL We have two kinds of asserts... neither are supposed to happen and both create a core file so that we can diagnose how we got there. However non-fatal ones call fork first (so the main process doesn't die) and then take some recovery action. Sometimes the non-fatal varieties are used in new pieces of code to make sure they work as we expect and that is what has happened here. Do you still have the core file? I'd be interested to know the result of: print *op from frame #4 In the meantime, I'll look at the logs and see what I can figure out. Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting untracked process process 19654 dumped core Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped! The cluster looks like this, unchanged for several hours: Last updated: Thu Apr 20 04:43:47 2006 Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d) 3 Nodes configured. 3 Resources configured. Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online Resource Group: group_1 IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03 LVM_2 (heartbeat::ocf:LVM): Stopped Filesystem_3(heartbeat::ocf:Filesystem):Stopped Resource Group: group_2 IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02 LVM_3 (heartbeat::ocf:LVM): Started sapcl02 Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02 Resource Group: group_3 IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03 LVM_4 (heartbeat::ocf:LVM): Started sapcl03 Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03 And: sapcl01# crmadmin -S sapcl01 Status of [EMAIL PROTECTED]: S_TERMINATE (ok) All processes are still running on this node, but heartbeat seems to be in some kind of limbo. Cheers, Dejan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] core dump (abort) in crmd: untracked process (HEAD)
Hello, Running CTS with HEAD hanged the cluster after crmd dumped core (abort). It happened after 53 tests with this curious message: Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: fsa_our_dc_version != NULL Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting untracked process process 19654 dumped core Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped! The cluster looks like this, unchanged for several hours: Last updated: Thu Apr 20 04:43:47 2006 Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d) 3 Nodes configured. 3 Resources configured. Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online Resource Group: group_1 IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03 LVM_2 (heartbeat::ocf:LVM): Stopped Filesystem_3(heartbeat::ocf:Filesystem):Stopped Resource Group: group_2 IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02 LVM_3 (heartbeat::ocf:LVM): Started sapcl02 Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02 Resource Group: group_3 IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03 LVM_4 (heartbeat::ocf:LVM): Started sapcl03 Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03 And: sapcl01# crmadmin -S sapcl01 Status of [EMAIL PROTECTED]: S_TERMINATE (ok) All processes are still running on this node, but heartbeat seems to be in some kind of limbo. Cheers, Dejan Using host libthread_db library /lib/tls/libthread_db.so.1. Core was generated by `/usr/lib/heartbeat/crmd'. Program terminated with signal 6, Aborted. #0 0xe410 in __kernel_vsyscall () #0 0xe410 in __kernel_vsyscall () #1 0x40284581 in raise () from /lib/tls/libc.so.6 #2 0x40285e65 in abort () from /lib/tls/libc.so.6 #3 0x40059488 in crm_abort (file=0x806859d lrm.c, function=0x80687c6 build_operation_update, line=349, assert_condition=0x806881d fsa_our_dc_version != NULL, do_fork=1) at utils.c:1201 #4 0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, src=0x80692d9 do_update_resource, lpc=0) at lrm.c:347 #5 0x0805db31 in do_update_resource (op=0x8282b98) at lrm.c:1383 #6 0x0805e0f7 in do_lrm_event (action=576460752303423488, cause=C_LRM_OP_CALLBACK, cur_state=S_INTEGRATION, cur_input=I_LRM_EVENT, msg_data=0x8234d68) at lrm.c:1514 #7 0x0804b572 in do_fsa_action (fsa_data=0x8234d68, an_action=576460752303423488, function=0x805dc31 do_lrm_event) at fsa.c:178 #8 0x0804c805 in s_crmd_fsa_actions (fsa_data=0x8234d68) at fsa.c:512 #9 0x0804bb36 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:315 #10 0x08055264 in crm_fsa_trigger (user_data=0x0) at callbacks.c:647 #11 0x4002987c in G_TRIG_dispatch (source=0x8072de8, callback=0, user_data=0x0) at GSource.c:1417 #12 0x400b29ca in g_main_context_dispatch () from /opt/gnome/lib/libglib-2.0.so.0 #13 0x400b4adb in g_main_context_iterate () from /opt/gnome/lib/libglib-2.0.so.0 #14 0x400b4d07 in g_main_loop_run () from /opt/gnome/lib/libglib-2.0.so.0 #15 0x0804af9b in init_start () at main.c:137 #16 0x0804aec6 in main (argc=1, argv=0xb9f4) at main.c:104 cib.xml.gz Description: Binary data log.gz Description: Binary data ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/