On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > Running CTS with HEAD hanged the cluster after crmd dumped core
> > (abort).  It happened after 53 tests with this curious message:
> >
> > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
> > lrm.c:349: fsa_our_dc_version != NULL
>
> We have two kinds of asserts... neither are supposed to happen and
> both create a core file so that we can diagnose how we got there.
> However non-fatal ones call fork first (so the main process doesn't
> die) and then take some recovery action.
>
> Sometimes the non-fatal varieties are used in new pieces of code to
> make sure they work as we expect and that is what has happened here.
>
> Do you still have the core file?
> I'd be interested to know the result of:
>    print *op
> from frame #4
>
> In the meantime, I'll look at the logs and see what I can figure out.

There is also a script Alan wrote to easily extract test data:
   /usr/lib/heartbeat/cts/extracttests.py

Can you tell me what test was being performed at the time you hit the assert?
Do you have logs from further back?

>
> > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > Exiting untracked process process 19654 dumped core
> > Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
> > mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!
> >
> > The cluster looks like this, unchanged for several hours:
> >
> > ============
> > Last updated: Thu Apr 20 04:43:47 2006
> > Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
> > 3 Nodes configured.
> > 3 Resources configured.
> > ============
> >
> > Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
> > Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
> > Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online
> >
> > Resource Group: group_1
> >     IPaddr_1    (heartbeat::ocf:IPaddr):        Started sapcl03
> >     LVM_2       (heartbeat::ocf:LVM):   Stopped
> >     Filesystem_3        (heartbeat::ocf:Filesystem):    Stopped
> > Resource Group: group_2
> >     IPaddr_2    (heartbeat::ocf:IPaddr):        Started sapcl02
> >     LVM_3       (heartbeat::ocf:LVM):   Started sapcl02
> >     Filesystem_4        (heartbeat::ocf:Filesystem):    Started sapcl02
> > Resource Group: group_3
> >     IPaddr_3    (heartbeat::ocf:IPaddr):        Started sapcl03
> >     LVM_4       (heartbeat::ocf:LVM):   Started sapcl03
> >     Filesystem_5        (heartbeat::ocf:Filesystem):    Started sapcl03
> >
> > And:
> >
> > sapcl01# crmadmin -S sapcl01
> > Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
> >
> > All processes are still running on this node, but heartbeat seems
> > to be in some kind of limbo.
> >
> > Cheers,
> >
> > Dejan
> >
> >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> >
> >
> >
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to