On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > Hello, > > > > Running CTS with HEAD hanged the cluster after crmd dumped core > > (abort). It happened after 53 tests with this curious message: > > > > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: > > mask(lrm.c:build_operation_update): Triggered non-fatal assert at > > lrm.c:349: fsa_our_dc_version != NULL > > We have two kinds of asserts... neither are supposed to happen and > both create a core file so that we can diagnose how we got there. > However non-fatal ones call fork first (so the main process doesn't > die) and then take some recovery action. > > Sometimes the non-fatal varieties are used in new pieces of code to > make sure they work as we expect and that is what has happened here. > > Do you still have the core file? > I'd be interested to know the result of: > print *op > from frame #4 > > In the meantime, I'll look at the logs and see what I can figure out.
There is also a script Alan wrote to easily extract test data: /usr/lib/heartbeat/cts/extracttests.py Can you tell me what test was being performed at the time you hit the assert? Do you have logs from further back? > > > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: > > Exiting untracked process process 19654 dumped core > > Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: > > mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped! > > > > The cluster looks like this, unchanged for several hours: > > > > ============ > > Last updated: Thu Apr 20 04:43:47 2006 > > Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d) > > 3 Nodes configured. > > 3 Resources configured. > > ============ > > > > Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online > > Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online > > Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online > > > > Resource Group: group_1 > > IPaddr_1 (heartbeat::ocf:IPaddr): Started sapcl03 > > LVM_2 (heartbeat::ocf:LVM): Stopped > > Filesystem_3 (heartbeat::ocf:Filesystem): Stopped > > Resource Group: group_2 > > IPaddr_2 (heartbeat::ocf:IPaddr): Started sapcl02 > > LVM_3 (heartbeat::ocf:LVM): Started sapcl02 > > Filesystem_4 (heartbeat::ocf:Filesystem): Started sapcl02 > > Resource Group: group_3 > > IPaddr_3 (heartbeat::ocf:IPaddr): Started sapcl03 > > LVM_4 (heartbeat::ocf:LVM): Started sapcl03 > > Filesystem_5 (heartbeat::ocf:Filesystem): Started sapcl03 > > > > And: > > > > sapcl01# crmadmin -S sapcl01 > > Status of [EMAIL PROTECTED]: S_TERMINATE (ok) > > > > All processes are still running on this node, but heartbeat seems > > to be in some kind of limbo. > > > > Cheers, > > > > Dejan > > > > > > _______________________________________________________ > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > > Home Page: http://linux-ha.org/ > > > > > > > > > _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/