On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> > On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > Hello,
> > >
> > > Running CTS with HEAD hanged the cluster after crmd dumped core
> > > (abort).  It happened after 53 tests with this curious message:
> > >
> > > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > > mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
> > > lrm.c:349: fsa_our_dc_version != NULL
> >
> > We have two kinds of asserts... neither are supposed to happen and
> > both create a core file so that we can diagnose how we got there.
> > However non-fatal ones call fork first (so the main process doesn't
> > die) and then take some recovery action.
> >
> > Sometimes the non-fatal varieties are used in new pieces of code to
> > make sure they work as we expect and that is what has happened here.
> >
> > Do you still have the core file?
> > I'd be interested to know the result of:
> >    print *op
> > from frame #4
> >
> > In the meantime, I'll look at the logs and see what I can figure out.
>
> There is also a script Alan wrote to easily extract test data:
>    /usr/lib/heartbeat/cts/extracttests.py
>
> Can you tell me what test was being performed at the time you hit the assert?
> Do you have logs from further back?

ok, i "found" the logs... my log reader was trying to be helpful :-/

two problems here:
1) one of the resource actions took longer than one of our internal timers
2) as a result of 1) the assert went off

to address 2) i've taken a slightly different approach to that part of
the code and it will be in CVS shortly

We appear to recover ok from 1) so I'm leaving the timer there but
doubling it's interval.  This timer is not supposed to go off in the
first place so increasing it should be safe.

> > > sapcl01# crmadmin -S sapcl01
> > > Status of [EMAIL PROTECTED]: S_TERMINATE (ok)

The only node i see exiting in the logs is sapcl02 which was stopped by CTS.

> > > All processes are still running on this node, but heartbeat seems
> > > to be in some kind of limbo.

I see this in the logs:

Apr 19 17:48:01 sapcl01 crmd: [17937]: info:
mask(fsa.c:do_state_transition): State transition S_TRANSITION_ENGINE
-> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=do_msg_route
]

so to me it looks like everything is back on track. no?
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to