Hi,

On Thu, Apr 20, 2006 at 10:58:05AM +0200, Andrew Beekhof wrote:
> On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> > On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> > > On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > > Hello,
> > > >
> > > > Running CTS with HEAD hanged the cluster after crmd dumped core
> > > > (abort).  It happened after 53 tests with this curious message:
> > > >
> > > > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > > > mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
> > > > lrm.c:349: fsa_our_dc_version != NULL
> > >
> > > We have two kinds of asserts... neither are supposed to happen and
> > > both create a core file so that we can diagnose how we got there.
> > > However non-fatal ones call fork first (so the main process doesn't
> > > die) and then take some recovery action.
> > >
> > > Sometimes the non-fatal varieties are used in new pieces of code to
> > > make sure they work as we expect and that is what has happened here.
> > >
> > > Do you still have the core file?
> > > I'd be interested to know the result of:
> > >    print *op
> > > from frame #4
> > >
> > > In the meantime, I'll look at the logs and see what I can figure out.
> >
> > There is also a script Alan wrote to easily extract test data:
> >    /usr/lib/heartbeat/cts/extracttests.py

Ha! This is cool. I was thinking about writing one myself :)

> two problems here:
> 1) one of the resource actions took longer than one of our internal timers

Hmm. All resources are sort of light: an IP address, a volume
on the f/o storage, and the corresponing journaled fs. Strange
that the timer went off.

> 2) as a result of 1) the assert went off
> 
> to address 2) i've taken a slightly different approach to that part of
> the code and it will be in CVS shortly
> 
> We appear to recover ok from 1) so I'm leaving the timer there but
> doubling it's interval.  This timer is not supposed to go off in the
> first place so increasing it should be safe.
> 
> > > > sapcl01# crmadmin -S sapcl01
> > > > Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
> 
> The only node i see exiting in the logs is sapcl02 which was stopped by CTS.

Yes, the test was:

Apr 19 17:42:38 lingws CTS: Running test NearQuorumPoint (sapcl02) [53]
Apr 19 17:42:38 lingws CTS: start nodes:['sapcl01', 'sapcl03']
Apr 19 17:42:38 lingws CTS: stop nodes:['sapcl02']

However, the DC (sapcl01) went berserk.

> > > > All processes are still running on this node, but heartbeat seems
> > > > to be in some kind of limbo.
> 
> I see this in the logs:
> 
> Apr 19 17:48:01 sapcl01 crmd: [17937]: info:
> mask(fsa.c:do_state_transition): State transition S_TRANSITION_ENGINE
> -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=do_msg_route
> ]
> 
> so to me it looks like everything is back on track. no?

No. The status as shown by crm_mon in one of my previous messages
is what remained so for many hours (10 or so). The cluster was
basically stalled.

However, after shutting down everything and starting the CTS from
scratch with the very same HEAD code ran perfectly OK:

Apr 20 05:18:31 >>>>>>>>>>>>>>>> BEGINNING 200 TESTS
...
Apr 20 11:01:50 Overall Results:{'failure': 0, 'success': 200, 'BadNews': 1876}

Apart from problems before the first test with pengine (looks like
some resource were left running after the shutdown, so they
eventually appeared to be running on two nodes), six monitor
operation failures with exit code OCF_NOT_RUNNING (why one per hour ;-)?
and tons of spurious messages from the LVM RA of this kind

Apr 20 05:21:40 BadNews: Apr 20 05:20:22 sapcl02 LVM[7867]: [7967]: ERROR: LVM 
Volume /dev/data03vg is offline

everything else went fine.

Obviously, the core dump was triggered due to some unusual
circumstances.

Cheers,

Dejan
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to