Re: [Linux-ha-dev] core dump (abort) in crmd: untracked process (HEAD)

2006-04-20 Thread Andrew Beekhof
On 4/20/06, Dejan Muhamedagic [EMAIL PROTECTED] wrote:
 Hello,

 Running CTS with HEAD hanged the cluster after crmd dumped core
 (abort).  It happened after 53 tests with this curious message:

 Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
 mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: 
 fsa_our_dc_version != NULL

We have two kinds of asserts... neither are supposed to happen and
both create a core file so that we can diagnose how we got there.
However non-fatal ones call fork first (so the main process doesn't
die) and then take some recovery action.

Sometimes the non-fatal varieties are used in new pieces of code to
make sure they work as we expect and that is what has happened here.

Do you still have the core file?
I'd be interested to know the result of:
   print *op
from frame #4

In the meantime, I'll look at the logs and see what I can figure out.

 Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
 Exiting untracked process process 19654 dumped core
 Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
 mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!

 The cluster looks like this, unchanged for several hours:

 
 Last updated: Thu Apr 20 04:43:47 2006
 Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
 3 Nodes configured.
 3 Resources configured.
 

 Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
 Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
 Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online

 Resource Group: group_1
 IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
 LVM_2   (heartbeat::ocf:LVM):   Stopped
 Filesystem_3(heartbeat::ocf:Filesystem):Stopped
 Resource Group: group_2
 IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
 LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
 Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02
 Resource Group: group_3
 IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
 LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
 Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03

 And:

 sapcl01# crmadmin -S sapcl01
 Status of [EMAIL PROTECTED]: S_TERMINATE (ok)

 All processes are still running on this node, but heartbeat seems
 to be in some kind of limbo.

 Cheers,

 Dejan


 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/




___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] core dump (abort) in crmd: untracked process (HEAD)

2006-04-20 Thread Andrew Beekhof
On 4/20/06, Andrew Beekhof [EMAIL PROTECTED] wrote:
 On 4/20/06, Dejan Muhamedagic [EMAIL PROTECTED] wrote:
  Hello,
 
  Running CTS with HEAD hanged the cluster after crmd dumped core
  (abort).  It happened after 53 tests with this curious message:
 
  Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
  mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
  lrm.c:349: fsa_our_dc_version != NULL

 We have two kinds of asserts... neither are supposed to happen and
 both create a core file so that we can diagnose how we got there.
 However non-fatal ones call fork first (so the main process doesn't
 die) and then take some recovery action.

 Sometimes the non-fatal varieties are used in new pieces of code to
 make sure they work as we expect and that is what has happened here.

 Do you still have the core file?
 I'd be interested to know the result of:
print *op
 from frame #4

 In the meantime, I'll look at the logs and see what I can figure out.

There is also a script Alan wrote to easily extract test data:
   /usr/lib/heartbeat/cts/extracttests.py

Can you tell me what test was being performed at the time you hit the assert?
Do you have logs from further back?


  Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
  Exiting untracked process process 19654 dumped core
  Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
  mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!
 
  The cluster looks like this, unchanged for several hours:
 
  
  Last updated: Thu Apr 20 04:43:47 2006
  Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
  3 Nodes configured.
  3 Resources configured.
  
 
  Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
  Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
  Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online
 
  Resource Group: group_1
  IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
  LVM_2   (heartbeat::ocf:LVM):   Stopped
  Filesystem_3(heartbeat::ocf:Filesystem):Stopped
  Resource Group: group_2
  IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
  LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
  Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02
  Resource Group: group_3
  IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
  LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
  Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03
 
  And:
 
  sapcl01# crmadmin -S sapcl01
  Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
 
  All processes are still running on this node, but heartbeat seems
  to be in some kind of limbo.
 
  Cheers,
 
  Dejan
 
 
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/
 
 
 
 

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] core dump (abort) in crmd: untracked process (HEAD)

2006-04-20 Thread Dejan Muhamedagic
Hi,

#4  0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, 
src=0x80692d9 do_update_resource, lpc=0) at lrm.c:347

(gdb) print *0x8282b98
$1 = 136448640

If you want I can send you the core off list. I keep all the cores :)

Cheers,

Dejan

On Thu, Apr 20, 2006 at 09:15:28AM +0200, Andrew Beekhof wrote:
 On 4/20/06, Dejan Muhamedagic [EMAIL PROTECTED] wrote:
  Hello,
 
  Running CTS with HEAD hanged the cluster after crmd dumped core
  (abort).  It happened after 53 tests with this curious message:
 
  Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
  mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
  lrm.c:349: fsa_our_dc_version != NULL
 
 We have two kinds of asserts... neither are supposed to happen and
 both create a core file so that we can diagnose how we got there.
 However non-fatal ones call fork first (so the main process doesn't
 die) and then take some recovery action.
 
 Sometimes the non-fatal varieties are used in new pieces of code to
 make sure they work as we expect and that is what has happened here.
 
 Do you still have the core file?
 I'd be interested to know the result of:
print *op
 from frame #4
 
 In the meantime, I'll look at the logs and see what I can figure out.
 
  Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
  Exiting untracked process process 19654 dumped core
  Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
  mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!
 
  The cluster looks like this, unchanged for several hours:
 
  
  Last updated: Thu Apr 20 04:43:47 2006
  Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
  3 Nodes configured.
  3 Resources configured.
  
 
  Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
  Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
  Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online
 
  Resource Group: group_1
  IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
  LVM_2   (heartbeat::ocf:LVM):   Stopped
  Filesystem_3(heartbeat::ocf:Filesystem):Stopped
  Resource Group: group_2
  IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
  LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
  Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02
  Resource Group: group_3
  IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
  LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
  Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03
 
  And:
 
  sapcl01# crmadmin -S sapcl01
  Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
 
  All processes are still running on this node, but heartbeat seems
  to be in some kind of limbo.
 
  Cheers,
 
  Dejan
 
 
  ___
  Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
  Home Page: http://linux-ha.org/
 
 
 
 
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] core dump (abort) in crmd: untracked process (HEAD)

2006-04-19 Thread Dejan Muhamedagic
Hello,

Running CTS with HEAD hanged the cluster after crmd dumped core
(abort).  It happened after 53 tests with this curious message:

Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: 
fsa_our_dc_version != NULL
Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting 
untracked process process 19654 dumped core
Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!

The cluster looks like this, unchanged for several hours:


Last updated: Thu Apr 20 04:43:47 2006
Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
3 Nodes configured.
3 Resources configured.


Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online

Resource Group: group_1
IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
LVM_2   (heartbeat::ocf:LVM):   Stopped 
Filesystem_3(heartbeat::ocf:Filesystem):Stopped 
Resource Group: group_2
IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02
Resource Group: group_3
IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03

And:

sapcl01# crmadmin -S sapcl01
Status of [EMAIL PROTECTED]: S_TERMINATE (ok)

All processes are still running on this node, but heartbeat seems
to be in some kind of limbo.

Cheers,

Dejan
Using host libthread_db library /lib/tls/libthread_db.so.1.
Core was generated by `/usr/lib/heartbeat/crmd'.
Program terminated with signal 6, Aborted.
#0  0xe410 in __kernel_vsyscall ()
#0  0xe410 in __kernel_vsyscall ()
#1  0x40284581 in raise () from /lib/tls/libc.so.6
#2  0x40285e65 in abort () from /lib/tls/libc.so.6
#3  0x40059488 in crm_abort (file=0x806859d lrm.c, 
function=0x80687c6 build_operation_update, line=349, 
assert_condition=0x806881d fsa_our_dc_version != NULL, do_fork=1)
at utils.c:1201
#4  0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, 
src=0x80692d9 do_update_resource, lpc=0) at lrm.c:347
#5  0x0805db31 in do_update_resource (op=0x8282b98) at lrm.c:1383
#6  0x0805e0f7 in do_lrm_event (action=576460752303423488, 
cause=C_LRM_OP_CALLBACK, cur_state=S_INTEGRATION, cur_input=I_LRM_EVENT, 
msg_data=0x8234d68) at lrm.c:1514
#7  0x0804b572 in do_fsa_action (fsa_data=0x8234d68, 
an_action=576460752303423488, function=0x805dc31 do_lrm_event)
at fsa.c:178
#8  0x0804c805 in s_crmd_fsa_actions (fsa_data=0x8234d68) at fsa.c:512
#9  0x0804bb36 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:315
#10 0x08055264 in crm_fsa_trigger (user_data=0x0) at callbacks.c:647
#11 0x4002987c in G_TRIG_dispatch (source=0x8072de8, callback=0, user_data=0x0)
at GSource.c:1417
#12 0x400b29ca in g_main_context_dispatch ()
   from /opt/gnome/lib/libglib-2.0.so.0
#13 0x400b4adb in g_main_context_iterate ()
   from /opt/gnome/lib/libglib-2.0.so.0
#14 0x400b4d07 in g_main_loop_run () from /opt/gnome/lib/libglib-2.0.so.0
#15 0x0804af9b in init_start () at main.c:137
#16 0x0804aec6 in main (argc=1, argv=0xb9f4) at main.c:104


cib.xml.gz
Description: Binary data


log.gz
Description: Binary data
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/