On Nov 13, 2007, at 12:15 AM, Anders Brownworth wrote:
Hi,
I have a primary / backup v2.0.8 setup monitoring OpenSer and 2 IP
addresses.
If I make a mistake in a config file for a resource that is being
controlled by Linux-HA (OpenSer) and for whatever reason the
resource dies and a restart is attempted, the restart will fail and
the resource will migrate to the backup node as expected. However
once I fix the problem so the resource could start again on the
primary, I can never get Linux-HA to migrate the resource back.
I don't think this has anything to do with scoring because when I
don't break my config files and manually kill the service 13 times
on box01 (the reason for 13 is in my included cib.xml) the resources
migrates from box01 to box02 as expected. Setting the fail count
back below 13 causes the service to migrate back, also as expected.
However, trying to fail back to a system that previously had broken
OpenSer config files that have now been fixed, I can't get them to
come back no matter how low I set the fail count. Is there another
variable or INFINITY constraint somewhere that gets set when a
resource fails to start that makes the resources stay away? What can
I do when I want Linux-HA to re-try migration of the service back to
a recently hand fixed primary?
prior to the latest interim build, starts were always fatal and
required the use of crm_resource -C to make the node eligible again.
as of the last interim release, just make sure start-failure-is-
fatal=false and use crm_failcount as you have below for "normal"
failures.
Additionally, I followed the advice under "Resetting Failure Counts"
in the V2 FAQ ( http://linux-ha.org/v2/faq ) where it suggests:
crm_failcount -D -U nodeA -r my_rsc
Rather than reset the failure count, this just torches it in such a
way that you can't even read it with the query command given in the
next step of the same example. I found statically setting the count
back to 0 with:
crm_failcount -v 0 -U box01 -r OpenSer
worked much better and allowed me to push resources back and forth
just by moving the fail count up and down.
Thanks.
-Anders
<cib admin_epoch="1" have_quorum="true" num_peers="1"
cib_feature_revision="1.3" ignore_dtd="false" ccm_transition="3"
generated="true" dc_uuid="9052abe5-87ee-4400-a008-c5f13205e94b"
epoch="15" num_updates="606" cib-last-written="Mon Nov 12 22:37:10
2007">
<configuration>
<crm_config>
<cluster_property_set id="cluster-property-set">
<attributes>
<nvpair id="short_resource_names"
name="short_resource_names" value="true"/>
<nvpair id="pe-input-series-max" name="pe-input-series-max"
value="-1"/>
<nvpair id="default-resource-stickiness" name="default-
resource-stickiness" value="10"/>
<nvpair id="default-resource-failure-stickiness"
name="default-resource-failure-stickiness" value="-10"/>
</attributes>
</cluster_property_set>
</crm_config>
<nodes>
<node id="9052abe5-87ee-4400-a008-c5f13205e94b" uname="box01"
type="normal"/>
<node id="47658455-4da2-48d4-a8da-419b2f93f039" uname="box02"
type="normal"/>
</nodes>
<resources>
<group id="IPaddr2_OpenSer_group">
<primitive id="IPaddr2-10.1.53.235" class="ocf"
type="IPaddr2" provider="heartbeat">
<operations>
<op id="ipaddr2-10.1.53.235-monitor" name="monitor"
interval="5s" timeout="3s"/>
</operations>
<instance_attributes id="IPaddr2-10.1.53.235-attributes">
<attributes>
<nvpair id="ipaddr2-10.1.53.235-ip" name="ip"
value="10.1.53.235"/>
<nvpair id="ipaddr2-10.1.53.235-broadcast"
name="broadcast" value="10.1.53.255"/>
<nvpair id="ipaddr2-10.1.53.235-cidr_netmask"
name="cidr_netmask" value="24"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="IPaddr2-10.1.53.236" class="ocf"
type="IPaddr2" provider="heartbeat">
<operations>
<op id="ipaddr2-10.1.53.236-monitor" name="monitor"
interval="5s" timeout="3s"/>
</operations>
<instance_attributes id="IPaddr2-10.1.53.236-attributes">
<attributes>
<nvpair id="ipaddr2-10.1.53.236-ip" name="ip"
value="10.1.53.236"/>
<nvpair id="ipaddr2-10.1.53.236-broadcast"
name="broadcast" value="10.1.53.255"/>
<nvpair id="ipaddr2-10.1.53.236-cidr_netmask"
name="cidr_netmask" value="24"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="OpenSer" class="ocf" type="OpenSer"
provider="bandwidth.com">
<operations>
<op id="openser-start" name="start" timeout="5s"/>
<op id="openser-stop" name="stop" timeout="3s"/>
<op id="openser-monitor" name="monitor" interval="10s"
timeout="3s">
<instance_attributes id="monitor_10s">
<attributes>
<nvpair id="openser-monitor-ip" name="ip"
value="127.0.0.1"/>
</attributes>
</instance_attributes>
</op>
</operations>
</primitive>
</group>
</resources>
<constraints>
<rsc_location id="OpenSer_resource_location" rsc="OpenSer">
<rule id="rule_box01" score="100">
<expression id="expression_uname_eq_box01"
attribute="#uname" operation="eq" value="box01"/>
</rule>
<rule id="rule_box02" score="10">
<expression id="expression_uname_eq_box02"
attribute="#uname" operation="eq" value="box02"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems