Re: [Linux-HA] the return code of failing start action

Dejan Muhamedagic Tue, 09 Oct 2007 03:45:38 -0700

Hi,

On Tue, Oct 09, 2007 at 08:53:15AM +0200, Andrew Beekhof wrote:
> dejan, can you take a look at this pls?
> rc for an operation seems to be changing in the lrmd somehow


The last time start of prmDummy occured in lrmd was here:

tengine[24482]: 2007/10/09_13:42:59 WARN: action_timer_callback: Timer popped 
(abort_level=0, complete=false)
tengine[24482]: 2007/10/09_13:42:59 WARN: print_elem: Action missed its 
timeout[Action 4]: In-flight (id: prmDummy_start_0, loc: prec370e, priority: 0)
lrmd[24141]: 2007/10/09_13:42:59 WARN: prmDummy:start process (PID 24564) timed 
out (try 1).  Killing with signal SIGTERM (15).
lrmd[24141]: 2007/10/09_13:42:59 info: RA output: (prmDummy:start:stderr) 
Terminated

Dummy[24564][24582]: 2007/10/09_13:42:59 INFO: They use TERM to bring us down. 
No such luck.
lrmd[24141]: 2007/10/09_13:43:04 WARN: prmDummy:start process (PID 24564) timed 
out (try 2).  Killing with signal SIGKILL (9).
lrmd[24141]: 2007/10/09_13:43:04 WARN: Exiting prmDummy:start process 24564 
killed by signal 9 [SIGKILL - Kill, unblockable].
lrmd[24141]: 2007/10/09_13:43:04 WARN: operation start[3] on 
ocf::Dummy::prmDummy for client 24144, its parameters: 
CRM_meta_id=[opDummyStart] delay=[1] CRM_meta_timeout=[10000] 
crm_feature_set=[2.0] CRM_meta_name=[start] : pid [24564] timed out

Thanks,

Dejan

> On 10/9/07, Junko IKEDA <[EMAIL PROTECTED]> wrote:
> > > > Hi,
> > > >
> > > > when I tried the following case,
> > > > the return code of start action was something strange.
> > > >
> > > > 1) There are two node; active and standby node
> > > > 2) one resource is running on the active node
> > > > 3) SplitBrain came up!
> > >
> > > you created a split brain or it occurred on its own?
> >
> > I created it on purpose.
> 
> ok
> 
> >
> > > > 4) the resource would be going to start on the both node,
> > >
> > > you dont have stonith configured right?
> > >
> > > because this is exactly the reason why two-node clusters, particularly
> > > ones without stonith configured are a seriously bad idea.
> > >
> > > at least configure pingd so that only one side will try and run the
> > resources
> >
> > There is no stonith configuration for now.
> > This might sound strange, but we are testing some worst cases without
> > stonith.
> 
> just a little... it reminds me of the old joke:
> 
> patient: doctor, doctor, it hurts when i do this!
> doctor: well, dont do that then
> 
> 
> is the concern that some part of the stonith setup will fail and you
> want to see how the cluster behaves without it?
> 
> otherwise i confess i dont see the point.
> 
> > It's sure that stonith can help this situation if it's configured.
> >
> > > >    I drive it into failure on purpose on the standby node.
> > > >    so, the return code of start action would be -1 on standby.
> > > >    (it worked well)
> > >
> > > -1 means "timed out"... thats not a good value to return from an RA
> >
> > sorry for the lacking of talk...
> > I created it on purpose, too.
> > I wanted to know how heartbeat would work if an RA went into "timed out".
> 
> ah
> 
> > > the whole concept of trying to handle this is in a resource's start
> > > action is a horrible substitute for a correctly configured cluster.
> > > continuing down this path will only lead to pain.
> > >
> > > > 5) after recovering SplitBrain, the return code on standby node was
> > "-2"...
> > > >    and crm_mon on the active node also showed it as -2.
> > > >
> > > > Why is it incremented?
> > >
> > > i'm not sure i follow this anymore... which return code are you talking
> > about?
> > > if you're talking about the one from the start action, it is never
> > > modified in any way
> >
> > the return code for "timed out" (maybe) became -2 after recovering from
> > SplitBrain.
> > It was -1 first.
> 
> how odd
> 
> > I tried to gather the log files with hb_report and attached it.
> >
> > build_operation_update() said like this;
> >
> > debug: build_operation_update: Calculated digest
> > e68af41c5248ad5766285315f043c074 for prmDummy_start_0
> > (2:-1;4:3:22520a1d-c026-4941-a403-717fc054c2c3)
> >
> > ...
> >
> > debug: build_operation_update: Calculated digest
> > e68af41c5248ad5766285315f043c074 for prmDummy_start_0
> > (2:-2;4:3:22520a1d-c026-4941-a403-717fc054c2c3)
> 
> in that case we're just using the value supplied by the lrm
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] the return code of failing start action

Reply via email to