Re: [Linux-HA] Issues with simple failover setup

Dominik Klein Mon, 05 Jan 2009 03:32:48 -0800

Dejan Muhamedagic wrote:
> Hi,
> 
> On Sun, Jan 04, 2009 at 10:04:58AM +0000, Stephen Nelson-Smith wrote:
>> Hi,
>>
>> I am running Heartbeat 2.3 on CentOS 5.2.  I have 2 nodes - both
>> apache servers.  All I want to achieve is a simple failover:
>>
>> In the case where one of the two nodes is running httpd, if the
>> running  node experiences a failure - httpd is stopped, or the machine
>> stops responding (ie the network has been lost or the machine down
>> hard), fail over to the second node.
>>
>> I seem to have achieved this when starting with a fresh install.  I
>> have defined two resources:
>>
>> <resources>
>>          <primitive class="ocf" id="IPaddr_10_0_0_53"
>> provider="heartbeat" type="IPaddr">
>>            <operations>
>>              <op id="IPaddr_10_0_0_53_mon" interval="5s"
>> name="monitor" timeout="5s"/>
>>            </operations>
>>            <instance_attributes id="IPaddr_10_0_0_53_inst_attr">
>>              <attributes>
>>                <nvpair id="IPaddr_10_0_0_53_attr_0" name="ip"
>> value="10.0.0.53"/>
>>              </attributes>
>>            </instance_attributes>
>>          </primitive>
>>          <primitive class="lsb" id="httpd_2" provider="heartbeat" 
>> type="httpd">
>>            <operations>
>>              <op id="httpd_2_mon" interval="20s" name="monitor" 
>> timeout="10s"/>
>>            </operations>
>>          </primitive>
>>      </resources>
>>
>> As I understand it, the IP, primitive type="IPaddr" has a monitor set
>> to fire every 5 seconds, and
>> timeout after 5 seconds, and it has one attribute, the IP address itself.
>>
>> The httpd, primitive type="httpd", really just refers to the
>> /etc/init.d/httpd script, since it is of class="lsb".  It only has a
>> single operation and no attributes - the operation is a monitor which
>> fires every 10 seconds, and will timeout after 10 seconds.  For an
>> init script, the monitor just consists of running the script as
>> "/etc/init.d/httpd status" and looking for "running" in the response.
>>
>> I've defined one constraint:
>>
>>  <constraints>
>>        <rsc_colocation id="web_same" from="IPaddr_10_0_0_53"
>> to="httpd_2" score="INFINITY"/>
>>  </constraints>
>>
>>
>>  The IP address and the httpd are preferred to run on the same
>> machine, with INFINITE priority - in other words, they MUST run on the
>> same machine.
>>
>> This should have the effect of forcing the migration of both resources 
>> together.
>>
>> I've modified default-resource-stickiness and
>> default-resource-failure-stickiness:
>>
>> <nvpair id="cib-bootstrap-options-default-resource-stickiness"
>> name="default-resource-stickiness" value="1000"/>
>> <nvpair id="cib-bootstrap-options-default-resource-failure-stickiness"
>> name="default-resource-failure-stickiness" value="-6001"/>
>>
>> AIUI, these two options define how the CRM and the LRM handle failures
>> and failovers.
>>
>> The default-resource-stickiness is the score given to each active
>> resource on the active node, leading to a default score of 2000 for
>> the active
>> node and 0 for the inactive node.
>>
>> When there is a failure, the failure-stickiness score is applied, and
>> since it's negative, it should lower the score on the failed (active)
>> node to below 0, triggering a
>> failover.
>>
>> If the second node fails as well, that node will be taken negative,
>> leaving no nodes capable of running the resources.  If a node reboots,
>> it should reset its score to 0, or it can be manually reset by running
>> "crm_failcount -D -r httpd_2" on the previously-failed node.
>>
>> So far so good.  Do please correct my understanding if I've gone wrong.
> 
> No, everything looks ok. Just don't ask me to calculate the
> stickiness :)
> 
>> Live test below:
>>
>> Ok - so taking my cluster, erasing the cib with cibadmin -E, and
>> rebooting both nodes.  I've not got httpd starting by default on
>> either machine, so when they come up, I will start httpd  on one
>> machine.  Interestingly the result of cibadmin -E seems to have been
>> that cibadmin -Q now times out,
> 
> Shouldn't happen.
> 
>> so I've hacked around a bit deleting
>> /var/lib/heartbeat/crm/cib.xml and trying to load it, by making the
>> admin_epoch bigger than that which seemed to be there (though from
>> where I know not).
> 
> Fiddling with cib.xml is allowed only when heartbeat/CRM is not
> running. Otherwise, and that's prefered, use the CRM tools
> (crm_resource, cibadmin, etc).
> 
>> $ crm_resource -W -r httpd_2
>>
>> seems to show that httpd_2 is running on node2, and I can confirm
>> this.  I don't know how this happened, as I didn't start apache, but
>> it has happened...
>>
>> So - if I shutdown httpd on node 2, it should failover, and it does.
>> So, now apache is running on node 1, and node 2 should have a score of
>> -6001 as it failed.  This is reflected in the failcount on node 2.
>>
>> I shouldn't be able to move the resource back to node2 - it still has
>> a failure count > 0.
>>
>> However, it seems I can - using crm_resource -M -r httpd_2 -H node2
> 
> This inserts a -INFINITY location constraint...


Nope, with -H, it inserts an INFINITY (no minus) location constraint,
which overrides the numeric -6001 (or whatever it had at that point).
This forces httpd_2 to run on node2.

>> Ok - resetting the failcount to 0.  The cluster should be in the same
>> state it was before - let's try to kill apache.
>>
>> This time, apache seems to have restarted on node 2, and there was no
>> failover.  I don't understand this.  The failcount has gone back up to
>> 1, but the resource hasn't moved.
> 
> ... which prevents it from even again starting on this node.
> crm_resource should have printed a warning about it.

See above: Now node2 has +INFINITY, so httpd failure will not have any
effect on the score as failure stickiness is just a numeric value
(INFINITY - number = INFINITY).

Regards
Dominik
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Issues with simple failover setup

Reply via email to