Hello,

Running CTS with HEAD hanged the cluster after crmd dumped core
(abort).  It happened after 53 tests with this curious message:

Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: 
fsa_our_dc_version != NULL
Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting 
untracked process process 19654 dumped core
Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!

The cluster looks like this, unchanged for several hours:

============
Last updated: Thu Apr 20 04:43:47 2006
Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
3 Nodes configured.
3 Resources configured.
============

Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online

Resource Group: group_1
    IPaddr_1    (heartbeat::ocf:IPaddr):        Started sapcl03
    LVM_2       (heartbeat::ocf:LVM):   Stopped 
    Filesystem_3        (heartbeat::ocf:Filesystem):    Stopped 
Resource Group: group_2
    IPaddr_2    (heartbeat::ocf:IPaddr):        Started sapcl02
    LVM_3       (heartbeat::ocf:LVM):   Started sapcl02
    Filesystem_4        (heartbeat::ocf:Filesystem):    Started sapcl02
Resource Group: group_3
    IPaddr_3    (heartbeat::ocf:IPaddr):        Started sapcl03
    LVM_4       (heartbeat::ocf:LVM):   Started sapcl03
    Filesystem_5        (heartbeat::ocf:Filesystem):    Started sapcl03

And:

sapcl01# crmadmin -S sapcl01
Status of [EMAIL PROTECTED]: S_TERMINATE (ok)

All processes are still running on this node, but heartbeat seems
to be in some kind of limbo.

Cheers,

Dejan
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Core was generated by `/usr/lib/heartbeat/crmd'.
Program terminated with signal 6, Aborted.
#0  0xffffe410 in __kernel_vsyscall ()
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x40284581 in raise () from /lib/tls/libc.so.6
#2  0x40285e65 in abort () from /lib/tls/libc.so.6
#3  0x40059488 in crm_abort (file=0x806859d "lrm.c", 
    function=0x80687c6 "build_operation_update", line=349, 
    assert_condition=0x806881d "fsa_our_dc_version != NULL", do_fork=1)
    at utils.c:1201
#4  0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, 
    src=0x80692d9 "do_update_resource", lpc=0) at lrm.c:347
#5  0x0805db31 in do_update_resource (op=0x8282b98) at lrm.c:1383
#6  0x0805e0f7 in do_lrm_event (action=576460752303423488, 
    cause=C_LRM_OP_CALLBACK, cur_state=S_INTEGRATION, cur_input=I_LRM_EVENT, 
    msg_data=0x8234d68) at lrm.c:1514
#7  0x0804b572 in do_fsa_action (fsa_data=0x8234d68, 
    an_action=576460752303423488, function=0x805dc31 <do_lrm_event>)
    at fsa.c:178
#8  0x0804c805 in s_crmd_fsa_actions (fsa_data=0x8234d68) at fsa.c:512
#9  0x0804bb36 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:315
#10 0x08055264 in crm_fsa_trigger (user_data=0x0) at callbacks.c:647
#11 0x4002987c in G_TRIG_dispatch (source=0x8072de8, callback=0, user_data=0x0)
    at GSource.c:1417
#12 0x400b29ca in g_main_context_dispatch ()
   from /opt/gnome/lib/libglib-2.0.so.0
#13 0x400b4adb in g_main_context_iterate ()
   from /opt/gnome/lib/libglib-2.0.so.0
#14 0x400b4d07 in g_main_loop_run () from /opt/gnome/lib/libglib-2.0.so.0
#15 0x0804af9b in init_start () at main.c:137
#16 0x0804aec6 in main (argc=1, argv=0xbffff9f4) at main.c:104

Attachment: cib.xml.gz
Description: Binary data

Attachment: log.gz
Description: Binary data

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to