Re: [Linux-HA] Failure to start resource makes it impossible to fail back

Andrew Beekhof Mon, 12 Nov 2007 23:37:31 -0800


On Nov 13, 2007, at 12:15 AM, Anders Brownworth wrote:

Hi,
I have a primary / backup v2.0.8 setup monitoring OpenSer and 2 IPaddresses.
If I make a mistake in a config file for a resource that is beingcontrolled by Linux-HA (OpenSer) and for whatever reason theresource dies and a restart is attempted, the restart will fail andthe resource will migrate to the backup node as expected. Howeveronce I fix the problem so the resource could start again on theprimary, I can never get Linux-HA to migrate the resource back.
I don't think this has anything to do with scoring because when Idon't break my config files and manually kill the service 13 timeson box01 (the reason for 13 is in my included cib.xml) the resourcesmigrates from box01 to box02 as expected. Setting the fail countback below 13 causes the service to migrate back, also as expected.
However, trying to fail back to a system that previously had brokenOpenSer config files that have now been fixed, I can't get them tocome back no matter how low I set the fail count. Is there anothervariable or INFINITY constraint somewhere that gets set when aresource fails to start that makes the resources stay away? What canI do when I want Linux-HA to re-try migration of the service back toa recently hand fixed primary?

prior to the latest interim build, starts were always fatal andrequired the use of crm_resource -C to make the node eligible again.

as of the last interim release, just make sure start-failure-is-fatal=false and use crm_failcount as you have below for "normal"failures.

Additionally, I followed the advice under "Resetting Failure Counts"in the V2 FAQ ( http://linux-ha.org/v2/faq ) where it suggests:
crm_failcount -D -U nodeA -r my_rsc
Rather than reset the failure count, this just torches it in such away that you can't even read it with the query command given in thenext step of the same example. I found statically setting the countback to 0 with:
crm_failcount -v 0 -U box01 -r OpenSer
worked much better and allowed me to push resources back and forthjust by moving the fail count up and down.
Thanks.

-Anders
<cib admin_epoch="1" have_quorum="true" num_peers="1"cib_feature_revision="1.3" ignore_dtd="false" ccm_transition="3"generated="true" dc_uuid="9052abe5-87ee-4400-a008-c5f13205e94b"epoch="15" num_updates="606" cib-last-written="Mon Nov 12 22:37:102007">
 <configuration>
   <crm_config>
     <cluster_property_set id="cluster-property-set">
       <attributes>
<nvpair id="short_resource_names"name="short_resource_names" value="true"/><nvpair id="pe-input-series-max" name="pe-input-series-max"value="-1"/><nvpair id="default-resource-stickiness" name="default-resource-stickiness" value="10"/><nvpair id="default-resource-failure-stickiness"name="default-resource-failure-stickiness" value="-10"/>
       </attributes>
     </cluster_property_set>
   </crm_config>
   <nodes>
<node id="9052abe5-87ee-4400-a008-c5f13205e94b" uname="box01"type="normal"/><node id="47658455-4da2-48d4-a8da-419b2f93f039" uname="box02"type="normal"/>
   </nodes>
   <resources>
     <group id="IPaddr2_OpenSer_group">
<primitive id="IPaddr2-10.1.53.235" class="ocf"type="IPaddr2" provider="heartbeat">
         <operations>
<op id="ipaddr2-10.1.53.235-monitor" name="monitor"interval="5s" timeout="3s"/>
         </operations>
         <instance_attributes id="IPaddr2-10.1.53.235-attributes">
           <attributes>
<nvpair id="ipaddr2-10.1.53.235-ip" name="ip"value="10.1.53.235"/><nvpair id="ipaddr2-10.1.53.235-broadcast"name="broadcast" value="10.1.53.255"/><nvpair id="ipaddr2-10.1.53.235-cidr_netmask"name="cidr_netmask" value="24"/>
           </attributes>
         </instance_attributes>
       </primitive>
<primitive id="IPaddr2-10.1.53.236" class="ocf"type="IPaddr2" provider="heartbeat">
         <operations>
<op id="ipaddr2-10.1.53.236-monitor" name="monitor"interval="5s" timeout="3s"/>
         </operations>
         <instance_attributes id="IPaddr2-10.1.53.236-attributes">
           <attributes>
<nvpair id="ipaddr2-10.1.53.236-ip" name="ip"value="10.1.53.236"/><nvpair id="ipaddr2-10.1.53.236-broadcast"name="broadcast" value="10.1.53.255"/><nvpair id="ipaddr2-10.1.53.236-cidr_netmask"name="cidr_netmask" value="24"/>
           </attributes>
         </instance_attributes>
       </primitive>
<primitive id="OpenSer" class="ocf" type="OpenSer"provider="bandwidth.com">
         <operations>
           <op id="openser-start" name="start" timeout="5s"/>
           <op id="openser-stop" name="stop" timeout="3s"/>
<op id="openser-monitor" name="monitor" interval="10s"timeout="3s">
             <instance_attributes id="monitor_10s">
               <attributes>
<nvpair id="openser-monitor-ip" name="ip"value="127.0.0.1"/>
               </attributes>
             </instance_attributes>
           </op>
         </operations>
       </primitive>
     </group>
   </resources>
   <constraints>
     <rsc_location id="OpenSer_resource_location" rsc="OpenSer">
       <rule id="rule_box01" score="100">
<expression id="expression_uname_eq_box01"attribute="#uname" operation="eq" value="box01"/>
       </rule>
       <rule id="rule_box02" score="10">
<expression id="expression_uname_eq_box02"attribute="#uname" operation="eq" value="box02"/>
       </rule>
     </rsc_location>
   </constraints>
 </configuration>
</cib>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failure to start resource makes it impossible to fail back

Reply via email to