On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Running CTS with HEAD hanged the cluster after crmd dumped core
> (abort).  It happened after 53 tests with this curious message:
>
> Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: 
> fsa_our_dc_version != NULL

We have two kinds of asserts... neither are supposed to happen and
both create a core file so that we can diagnose how we got there.
However non-fatal ones call fork first (so the main process doesn't
die) and then take some recovery action.

Sometimes the non-fatal varieties are used in new pieces of code to
make sure they work as we expect and that is what has happened here.

Do you still have the core file?
I'd be interested to know the result of:
   print *op
from frame #4

In the meantime, I'll look at the logs and see what I can figure out.

> Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> Exiting untracked process process 19654 dumped core
> Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
> mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!
>
> The cluster looks like this, unchanged for several hours:
>
> ============
> Last updated: Thu Apr 20 04:43:47 2006
> Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
> 3 Nodes configured.
> 3 Resources configured.
> ============
>
> Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
> Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
> Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online
>
> Resource Group: group_1
>     IPaddr_1    (heartbeat::ocf:IPaddr):        Started sapcl03
>     LVM_2       (heartbeat::ocf:LVM):   Stopped
>     Filesystem_3        (heartbeat::ocf:Filesystem):    Stopped
> Resource Group: group_2
>     IPaddr_2    (heartbeat::ocf:IPaddr):        Started sapcl02
>     LVM_3       (heartbeat::ocf:LVM):   Started sapcl02
>     Filesystem_4        (heartbeat::ocf:Filesystem):    Started sapcl02
> Resource Group: group_3
>     IPaddr_3    (heartbeat::ocf:IPaddr):        Started sapcl03
>     LVM_4       (heartbeat::ocf:LVM):   Started sapcl03
>     Filesystem_5        (heartbeat::ocf:Filesystem):    Started sapcl03
>
> And:
>
> sapcl01# crmadmin -S sapcl01
> Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
>
> All processes are still running on this node, but heartbeat seems
> to be in some kind of limbo.
>
> Cheers,
>
> Dejan
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
>
>
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to