Hi, On Tue, Apr 03, 2012 at 01:53:41PM +0200, David Gubler wrote: > Hi list, > > I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10 > (Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue > with the apache resource agent. > > But first things first, my test setup: > > root@node0:~# crm configure show > node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1 > node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0 > primitive apache ocf:heartbeat:apache \ > params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\ > op monitor interval="30" timeout="120" \ > meta is-managed="false" > primitive site0ip ocf:heartbeat:IPaddr \ > params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0" > primitive site1ip ocf:heartbeat:IPaddr \ > params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0" > clone apacheClone apache > colocation bothips -100: site0ip site1ip > colocation site0 inf: site0ip apacheClone > colocation site1 inf: site1ip apacheClone > property $id="cib-bootstrap-options" \ > dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \ > cluster-infrastructure="Heartbeat" \ > no-quorum-policy="ignore" \ > stonith-enabled="false" \ > last-lrm-refresh="1333391544" \ > cluster-recheck-interval="15min" > > > > One of the test I did was simulate a messed up apache (e.g. connection > limit reached): > $ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP > > Of course, this should produce a monitor timeout, which should mark the > apache as failed, and that's what happened. > > However, recovery didn't work after I did > $ iptables -F > > The problem, according to what I could figure out: > The apache resource agent > /usr/lib/ocf/resource.d/heartbeat/apache > does not have a timeout set for curl/wget. Curl has a default timeout of > about 3 minutes, wget may even retry up to 20 times and thus may > potentially take ages to time out. > > Thus, the monitor operation did time out instead of wget (thus, > pacemaker thinks that the monitor itself has failed instead of the > service it is monitoring, which is semantically just plain wrong, IMHO).
The timeout is a timeout, wherever it happens. > Since the resource agent let the (still waiting) wget process hang > around practically forever, it also didn't notice when apache had > recovered (after iptables -f). So, you want the resource agent to notice while running monitor that it can now talk to the server? > > Bottom line: > I think the apache resource agent badly needs a timeout parameter which > is supplied to wget/curl and the documentation should make clear that > the current monitor timeout provided by pacemaker is not a substitute > for that (it cannot really be used to detect non-responsive web > servers). I only figured that out after extensive testing and finally > looking at the source, which took an awful lot of time. > > After implementing a workaround: > WGETOPTS="-O- -q -L -T 5 -t 1 --no-proxy --bind-address=127.0.0.1" > (added -T 5 -t 1) pacemaker and the apache resource behaved as expected > even when doing the iptables test above, and apache quickly recovers > after I do iptables -F. Indeed in this case specifying a short timeout for the client would speed things up. It should loop indefinitely in the monitor op. We may accept a patch :) > On a side note: > The apache resource agent allows to supply a config file, where one can > override the parameters for curl/wget. But the implementation here is > bogus, because even if you supply this file, it always does a default > test with default parameters first, so this is useless in this case... > (I consider this behavior to be a bug). If you use a config test file, you'd need to define a monitor with depth 10. The depth 0 monitor (default) is always testing the statusurl. > Side note II: > I did play a lot with on-fail=..., failure-timeout=, > cluster-recheck-interval=... Changing these values did not help, but in > some cases produced new weird behavior, e.g. in some cases pacemaker > didn't even notice that apache was unreachable... Cheers, Dejan > Best regards, > > David > > -- > David Gubler > Senior Software & Operations Engineer > MeetMe: http://doodle.com/david > E-Mail: [email protected] > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
