Hi, On Tue, Oct 09, 2007 at 08:53:15AM +0200, Andrew Beekhof wrote: > dejan, can you take a look at this pls? > rc for an operation seems to be changing in the lrmd somehow
The last time start of prmDummy occured in lrmd was here: tengine[24482]: 2007/10/09_13:42:59 WARN: action_timer_callback: Timer popped (abort_level=0, complete=false) tengine[24482]: 2007/10/09_13:42:59 WARN: print_elem: Action missed its timeout[Action 4]: In-flight (id: prmDummy_start_0, loc: prec370e, priority: 0) lrmd[24141]: 2007/10/09_13:42:59 WARN: prmDummy:start process (PID 24564) timed out (try 1). Killing with signal SIGTERM (15). lrmd[24141]: 2007/10/09_13:42:59 info: RA output: (prmDummy:start:stderr) Terminated Dummy[24564][24582]: 2007/10/09_13:42:59 INFO: They use TERM to bring us down. No such luck. lrmd[24141]: 2007/10/09_13:43:04 WARN: prmDummy:start process (PID 24564) timed out (try 2). Killing with signal SIGKILL (9). lrmd[24141]: 2007/10/09_13:43:04 WARN: Exiting prmDummy:start process 24564 killed by signal 9 [SIGKILL - Kill, unblockable]. lrmd[24141]: 2007/10/09_13:43:04 WARN: operation start[3] on ocf::Dummy::prmDummy for client 24144, its parameters: CRM_meta_id=[opDummyStart] delay=[1] CRM_meta_timeout=[10000] crm_feature_set=[2.0] CRM_meta_name=[start] : pid [24564] timed out Thanks, Dejan > On 10/9/07, Junko IKEDA <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > > > > > when I tried the following case, > > > > the return code of start action was something strange. > > > > > > > > 1) There are two node; active and standby node > > > > 2) one resource is running on the active node > > > > 3) SplitBrain came up! > > > > > > you created a split brain or it occurred on its own? > > > > I created it on purpose. > > ok > > > > > > > 4) the resource would be going to start on the both node, > > > > > > you dont have stonith configured right? > > > > > > because this is exactly the reason why two-node clusters, particularly > > > ones without stonith configured are a seriously bad idea. > > > > > > at least configure pingd so that only one side will try and run the > > resources > > > > There is no stonith configuration for now. > > This might sound strange, but we are testing some worst cases without > > stonith. > > just a little... it reminds me of the old joke: > > patient: doctor, doctor, it hurts when i do this! > doctor: well, dont do that then > > > is the concern that some part of the stonith setup will fail and you > want to see how the cluster behaves without it? > > otherwise i confess i dont see the point. > > > It's sure that stonith can help this situation if it's configured. > > > > > > I drive it into failure on purpose on the standby node. > > > > so, the return code of start action would be -1 on standby. > > > > (it worked well) > > > > > > -1 means "timed out"... thats not a good value to return from an RA > > > > sorry for the lacking of talk... > > I created it on purpose, too. > > I wanted to know how heartbeat would work if an RA went into "timed out". > > ah > > > > the whole concept of trying to handle this is in a resource's start > > > action is a horrible substitute for a correctly configured cluster. > > > continuing down this path will only lead to pain. > > > > > > > 5) after recovering SplitBrain, the return code on standby node was > > "-2"... > > > > and crm_mon on the active node also showed it as -2. > > > > > > > > Why is it incremented? > > > > > > i'm not sure i follow this anymore... which return code are you talking > > about? > > > if you're talking about the one from the start action, it is never > > > modified in any way > > > > the return code for "timed out" (maybe) became -2 after recovering from > > SplitBrain. > > It was -1 first. > > how odd > > > I tried to gather the log files with hb_report and attached it. > > > > build_operation_update() said like this; > > > > debug: build_operation_update: Calculated digest > > e68af41c5248ad5766285315f043c074 for prmDummy_start_0 > > (2:-1;4:3:22520a1d-c026-4941-a403-717fc054c2c3) > > > > ... > > > > debug: build_operation_update: Calculated digest > > e68af41c5248ad5766285315f043c074 for prmDummy_start_0 > > (2:-2;4:3:22520a1d-c026-4941-a403-717fc054c2c3) > > in that case we're just using the value supplied by the lrm > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
