dejan, can you take a look at this pls? rc for an operation seems to be changing in the lrmd somehow
On 10/9/07, Junko IKEDA <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > when I tried the following case, > > > the return code of start action was something strange. > > > > > > 1) There are two node; active and standby node > > > 2) one resource is running on the active node > > > 3) SplitBrain came up! > > > > you created a split brain or it occurred on its own? > > I created it on purpose. ok > > > > 4) the resource would be going to start on the both node, > > > > you dont have stonith configured right? > > > > because this is exactly the reason why two-node clusters, particularly > > ones without stonith configured are a seriously bad idea. > > > > at least configure pingd so that only one side will try and run the > resources > > There is no stonith configuration for now. > This might sound strange, but we are testing some worst cases without > stonith. just a little... it reminds me of the old joke: patient: doctor, doctor, it hurts when i do this! doctor: well, dont do that then is the concern that some part of the stonith setup will fail and you want to see how the cluster behaves without it? otherwise i confess i dont see the point. > It's sure that stonith can help this situation if it's configured. > > > > I drive it into failure on purpose on the standby node. > > > so, the return code of start action would be -1 on standby. > > > (it worked well) > > > > -1 means "timed out"... thats not a good value to return from an RA > > sorry for the lacking of talk... > I created it on purpose, too. > I wanted to know how heartbeat would work if an RA went into "timed out". ah > > the whole concept of trying to handle this is in a resource's start > > action is a horrible substitute for a correctly configured cluster. > > continuing down this path will only lead to pain. > > > > > 5) after recovering SplitBrain, the return code on standby node was > "-2"... > > > and crm_mon on the active node also showed it as -2. > > > > > > Why is it incremented? > > > > i'm not sure i follow this anymore... which return code are you talking > about? > > if you're talking about the one from the start action, it is never > > modified in any way > > the return code for "timed out" (maybe) became -2 after recovering from > SplitBrain. > It was -1 first. how odd > I tried to gather the log files with hb_report and attached it. > > build_operation_update() said like this; > > debug: build_operation_update: Calculated digest > e68af41c5248ad5766285315f043c074 for prmDummy_start_0 > (2:-1;4:3:22520a1d-c026-4941-a403-717fc054c2c3) > > ... > > debug: build_operation_update: Calculated digest > e68af41c5248ad5766285315f043c074 for prmDummy_start_0 > (2:-2;4:3:22520a1d-c026-4941-a403-717fc054c2c3) in that case we're just using the value supplied by the lrm _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
