Dejan Muhamedagic wrote: > Hi, > > On Sun, Jan 04, 2009 at 10:04:58AM +0000, Stephen Nelson-Smith wrote: >> Hi, >> >> I am running Heartbeat 2.3 on CentOS 5.2. I have 2 nodes - both >> apache servers. All I want to achieve is a simple failover: >> >> In the case where one of the two nodes is running httpd, if the >> running node experiences a failure - httpd is stopped, or the machine >> stops responding (ie the network has been lost or the machine down >> hard), fail over to the second node. >> >> I seem to have achieved this when starting with a fresh install. I >> have defined two resources: >> >> <resources> >> <primitive class="ocf" id="IPaddr_10_0_0_53" >> provider="heartbeat" type="IPaddr"> >> <operations> >> <op id="IPaddr_10_0_0_53_mon" interval="5s" >> name="monitor" timeout="5s"/> >> </operations> >> <instance_attributes id="IPaddr_10_0_0_53_inst_attr"> >> <attributes> >> <nvpair id="IPaddr_10_0_0_53_attr_0" name="ip" >> value="10.0.0.53"/> >> </attributes> >> </instance_attributes> >> </primitive> >> <primitive class="lsb" id="httpd_2" provider="heartbeat" >> type="httpd"> >> <operations> >> <op id="httpd_2_mon" interval="20s" name="monitor" >> timeout="10s"/> >> </operations> >> </primitive> >> </resources> >> >> As I understand it, the IP, primitive type="IPaddr" has a monitor set >> to fire every 5 seconds, and >> timeout after 5 seconds, and it has one attribute, the IP address itself. >> >> The httpd, primitive type="httpd", really just refers to the >> /etc/init.d/httpd script, since it is of class="lsb". It only has a >> single operation and no attributes - the operation is a monitor which >> fires every 10 seconds, and will timeout after 10 seconds. For an >> init script, the monitor just consists of running the script as >> "/etc/init.d/httpd status" and looking for "running" in the response. >> >> I've defined one constraint: >> >> <constraints> >> <rsc_colocation id="web_same" from="IPaddr_10_0_0_53" >> to="httpd_2" score="INFINITY"/> >> </constraints> >> >> >> The IP address and the httpd are preferred to run on the same >> machine, with INFINITE priority - in other words, they MUST run on the >> same machine. >> >> This should have the effect of forcing the migration of both resources >> together. >> >> I've modified default-resource-stickiness and >> default-resource-failure-stickiness: >> >> <nvpair id="cib-bootstrap-options-default-resource-stickiness" >> name="default-resource-stickiness" value="1000"/> >> <nvpair id="cib-bootstrap-options-default-resource-failure-stickiness" >> name="default-resource-failure-stickiness" value="-6001"/> >> >> AIUI, these two options define how the CRM and the LRM handle failures >> and failovers. >> >> The default-resource-stickiness is the score given to each active >> resource on the active node, leading to a default score of 2000 for >> the active >> node and 0 for the inactive node. >> >> When there is a failure, the failure-stickiness score is applied, and >> since it's negative, it should lower the score on the failed (active) >> node to below 0, triggering a >> failover. >> >> If the second node fails as well, that node will be taken negative, >> leaving no nodes capable of running the resources. If a node reboots, >> it should reset its score to 0, or it can be manually reset by running >> "crm_failcount -D -r httpd_2" on the previously-failed node. >> >> So far so good. Do please correct my understanding if I've gone wrong. > > No, everything looks ok. Just don't ask me to calculate the > stickiness :) > >> Live test below: >> >> Ok - so taking my cluster, erasing the cib with cibadmin -E, and >> rebooting both nodes. I've not got httpd starting by default on >> either machine, so when they come up, I will start httpd on one >> machine. Interestingly the result of cibadmin -E seems to have been >> that cibadmin -Q now times out, > > Shouldn't happen. > >> so I've hacked around a bit deleting >> /var/lib/heartbeat/crm/cib.xml and trying to load it, by making the >> admin_epoch bigger than that which seemed to be there (though from >> where I know not). > > Fiddling with cib.xml is allowed only when heartbeat/CRM is not > running. Otherwise, and that's prefered, use the CRM tools > (crm_resource, cibadmin, etc). > >> $ crm_resource -W -r httpd_2 >> >> seems to show that httpd_2 is running on node2, and I can confirm >> this. I don't know how this happened, as I didn't start apache, but >> it has happened... >> >> So - if I shutdown httpd on node 2, it should failover, and it does. >> So, now apache is running on node 1, and node 2 should have a score of >> -6001 as it failed. This is reflected in the failcount on node 2. >> >> I shouldn't be able to move the resource back to node2 - it still has >> a failure count > 0. >> >> However, it seems I can - using crm_resource -M -r httpd_2 -H node2 > > This inserts a -INFINITY location constraint...
Nope, with -H, it inserts an INFINITY (no minus) location constraint, which overrides the numeric -6001 (or whatever it had at that point). This forces httpd_2 to run on node2. >> Ok - resetting the failcount to 0. The cluster should be in the same >> state it was before - let's try to kill apache. >> >> This time, apache seems to have restarted on node 2, and there was no >> failover. I don't understand this. The failcount has gone back up to >> 1, but the resource hasn't moved. > > ... which prevents it from even again starting on this node. > crm_resource should have printed a warning about it. See above: Now node2 has +INFINITY, so httpd failure will not have any effect on the score as failure stickiness is just a numeric value (INFINITY - number = INFINITY). Regards Dominik _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
