Re: [Linux-HA] Failure to start resource makes it impossible to fail back

Andrew Beekhof Tue, 13 Nov 2007 08:58:17 -0800


On Nov 13, 2007, at 5:17 PM, Anders Brownworth wrote:

Thanks for the quick response, Andrew.
'crm_resource -C -r OpenSer' seems to work but I do get an errorabout last-lrm-refresh not being able to be set:
Nov 13 14:00:12 box01 crm_resource: [11391]: ERROR:cib_native_perform_op: Call failed: The object/attribute does notexistNov 13 14:00:12 box01 crm_resource: [11391]: ERROR: update_attr:Error setting last-lrm-refresh=1194962406 (section=crm_config,set=cib-bootstrap-options): The object/attribute does not exist


This shouldn't be important.
What version are you running?

The resource does, however, fail back when I do that AND set thefail-count to 0 on the primary and backup.
But the resource won't fail back unless fail-count is defined on thebackup. The fail-count is initially undefined:
(box01:~) # crm_failcount -G -r OpenSer -U box02
name=fail-count-OpenSer value=(null)
Error performing operation: The object/attribute does not exist
Because the service failed to start previously on the primary,(box01) the fail-count is defined there. Once I define the fail-count on the backup (box02)
(box01:~) # crm_failcount -v 0 -r OpenSer -U box02
(box01:~) # crm_failcount -G -r OpenSer -U box02
name=fail-count-OpenSer value=0

it migrates back as expected.


Thats really weird (and looks like a bug).
Can you try with a later version?

Unless its not important what the update contains and just that thereis one^... so the TE gets triggered and does the migration.

Thats what the "last-lrm-refresh" code above it supposed to be doing.That not working could cause this kind of behavior.

I suppose I should add a "set fail-count to 0" for both box01 andbox02 in my startup scripts so merely doing a 'crm_resource -C -rOpenSer' migrates the service back after the initial failure.
Is there a better way to be doing this?

-Anders

Andrew Beekhof wrote:
prior to the latest interim build, starts were always fatal andrequired the use of crm_resource -C to make the node eligible again.
as of the last interim release, just make sure start-failure-is-fatal=false and use crm_failcount as you have below for "normal"failures.
Additionally, I followed the advice under "Resetting FailureCounts" in the V2 FAQ ( http://linux-ha.org/v2/faq ) where itsuggests:
crm_failcount -D -U nodeA -r my_rsc
Rather than reset the failure count, this just torches it in sucha way that you can't even read it with the query command given inthe next step of the same example. I found statically setting thecount back to 0 with:
crm_failcount -v 0 -U box01 -r OpenSer
worked much better and allowed me to push resources back and forthjust by moving the fail count up and down.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failure to start resource makes it impossible to fail back

Reply via email to