Re: [Linux-HA] standby does not take over on multiple power failure

Andrew Beekhof Tue, 05 Jun 2007 03:46:59 -0700

according to these logs it happened even earlier...

can i suggest running:


crm_failcount -D -r rsc_lim3 -U ha-9
crm_failcount -D -r rsc_lim8 -U ha-9

and re-testing?  I'd be very surprised if it didn't start working as a result.

On 6/5/07, Thomas Åkerblom (HF/EBC) <[EMAIL PROTECTED]> wrote:

OK, that would explain that part.
Here is the logfile covering that day.

BR.
/Thomas

 *** Thomas
This communication is confidential and intended solely for the addressee(s). 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
believe this message has been sent to you in error, please notify the sender by 
replying to this transmission and delete the message without disclosing it. 
Thank you.
E-mail including attachments is susceptible to data corruption, interruption, 
unauthorized amendment, tampering and viruses, and we only send and receive 
e-mails on the basis that we are not liable for any such corruption, 
interception, amendment, tampering or viruses or any consequences thereof.


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrew Beekhof
Sent: den 5 juni 2007 09:49
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] standby does not take over on multiple power failure

On 6/5/07, Thomas Åkerblom (HF/EBC) <[EMAIL PROTECTED]> wrote:
> Hi Andrew, and thank you for prompt response.
> My objection with the failure stickiness is that in my lab the rscX will get 
back to nodeY after running on the standby.
>
> Ex:
> I pull the power cord to ha-8
>         rsc_lim8 will start to run on the standby ha-9
> I insert the power cord to ha-8
>         rsc_lim8 will stop on ha-9 and start to run on ha-8
> This can be repeated.
>
> In my case the standby will get in a state where it will not start rsc_lim8 
when ha-8 goes down, ever. That happens after the procedure I described when I 
first pull the power cord of one server, inserts it and pull the cord of another 
server before the first has started properly. And that behavior ceased when I 
removed the failure stickiness.



resource_failure_stickiness shouldn't apply here because the entire
node failed rather than just the resource

according to the logs you attached, something other than a full node
failure also happened which caused the failcount for rsc_lim3 and
rsc_lim_8 to be set to 1.

pengine[24452]: 2007/06/04_08:52:57 debug: process_rsc_state:
fail-count-rsc_lim3: 1
pengine[24452]: 2007/06/04_08:52:57 debug: process_rsc_state: Setting
failure stickiness for rsc_lim3 on ha-9: -1000000
pengine[24452]: 2007/06/04_08:52:57 debug: process_rsc_state:
fail-count-rsc_lim8: 1
pengine[24452]: 2007/06/04_08:52:57 debug: process_rsc_state: Setting
failure stickiness for rsc_lim8 on ha-9: -1000000

unfortunately the logs dont go back far enough to know what or when
that event was

>
> BR.
> /Thomas
>
>  *** Thomas
> This communication is confidential and intended solely for the addressee(s). 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
believe this message has been sent to you in error, please notify the sender by 
replying to this transmission and delete the message without disclosing it. Thank 
you.
> E-mail including attachments is susceptible to data corruption, interruption, 
unauthorized amendment, tampering and viruses, and we only send and receive 
e-mails on the basis that we are not liable for any such corruption, interception, 
amendment, tampering or viruses or any consequences thereof.
>
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrew Beekhof
> Sent: den 4 juni 2007 13:20
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] standby does not take over on multiple power failure
>
> On 6/4/07, Thomas Åkerblom (HF/EBC) <[EMAIL PROTECTED]> wrote:
> > Hi Andrew.
> > I'm using 2.0.8-0.15, but I have seen the same behavior in 2.0.7.
> > In this case ha-9 is DC and also the standby server.
> > ha-8 has no power, but the standby server does not take over.
> > The logs begin right before I pulled the power cord.
> >
> > Actually I do know how to get around this problem now, but I also have some 
new questions.
> > If I remove the line:
> > <nvpair id="default_resource_failure_stickiness" 
name="default_resource_failure_stickiness" value="-INFINITY"/>
> > In the cib file the problem disappears.
> > I wouldn't expect that parameter to have this effect, rather the opposite.
> > Is this a known/expected correlation?
>
> not so much "correlation" as "thats what its designed to do".
>
> setting default_resource_failure_stickiness=-INFINITY means that if
> heartbeat finds the rscX as failed on nodeY, then never ever consider
> nodeY as a valid place to run rscX ever again... at least not until
> the admin "clears" the error by resetting the failcount.
>
> in the future we'll expire the failures after "a period of time" but
> that is not yet implemented as the lrm doesn't provide the infomation
> to do so.
>
> > I would like to set that parameter in order to be able to use the failure 
counters.
> > Furthermore I am not able to read and reset the counters using:
> >
> > crm_failcount -G -U ha-8 -r rsc_lim8
> >         The result is always 0
> >
> > crm_failcount -D -U ha-8 -r rsc_lim8
> >         Error performing operation: The object/attribute does not exist.
>
> later versions return 0 instead of "The object/attribute does not exist."
>
> updated packages for most distros/platforms are available at:
>    http://software.opensuse.org/download/server:/ha-clustering/
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] standby does not take over on multiple power failure

Reply via email to