Hi list,

I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10 
(Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue 
with the apache resource agent.

But first things first, my test setup:

root@node0:~# crm configure show
node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1
node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0
primitive apache ocf:heartbeat:apache \
         params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\
         op monitor interval="30" timeout="120" \
         meta is-managed="false"
primitive site0ip ocf:heartbeat:IPaddr \
         params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0"
primitive site1ip ocf:heartbeat:IPaddr \
         params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0"
clone apacheClone apache
colocation bothips -100: site0ip site1ip
colocation site0 inf: site0ip apacheClone
colocation site1 inf: site1ip apacheClone
property $id="cib-bootstrap-options" \
         dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
         cluster-infrastructure="Heartbeat" \
         no-quorum-policy="ignore" \
         stonith-enabled="false" \
         last-lrm-refresh="1333391544" \
         cluster-recheck-interval="15min"



One of the test I did was simulate a messed up apache (e.g. connection 
limit reached):
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP

Of course, this should produce a monitor timeout, which should mark the 
apache as failed, and that's what happened.

However, recovery didn't work after I did
$ iptables -F

The problem, according to what I could figure out:
The apache resource agent
/usr/lib/ocf/resource.d/heartbeat/apache
does not have a timeout set for curl/wget. Curl has a default timeout of 
about 3 minutes, wget may even retry up to 20 times and thus may 
potentially take ages to time out.

Thus, the monitor operation did time out instead of wget (thus, 
pacemaker thinks that the monitor itself has failed instead of the 
service it is monitoring, which is semantically just plain wrong, IMHO). 
Since the resource agent let the (still waiting) wget process hang 
around practically forever, it also didn't notice when apache had 
recovered (after iptables -f).


Bottom line:
I think the apache resource agent badly needs a timeout parameter which 
is supplied to wget/curl and the documentation should make clear that 
the current monitor timeout provided by pacemaker is not a substitute 
for that (it cannot really be used to detect non-responsive web 
servers). I only figured that out after extensive testing and finally 
looking at the source, which took an awful lot of time.

After implementing a workaround:
WGETOPTS="-O- -q -L -T 5 -t 1 --no-proxy --bind-address=127.0.0.1"
(added -T 5 -t 1) pacemaker and the apache resource behaved as expected 
even when doing the iptables test above, and apache quickly recovers 
after I do iptables -F.



On a side note:
The apache resource agent allows to supply a config file, where one can 
override the parameters for curl/wget. But the implementation here is 
bogus, because even if you supply this file, it always does a default 
test with default parameters first, so this is useless in this case... 
(I consider this behavior to be a bug).

Side note II:
I did play a lot with on-fail=..., failure-timeout=, 
cluster-recheck-interval=... Changing these values did not help, but in 
some cases produced new weird behavior, e.g. in some cases pacemaker 
didn't even notice that apache was unreachable...

Best regards,

David

-- 
David Gubler
Senior Software & Operations Engineer
MeetMe: http://doodle.com/david
E-Mail: [email protected]
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to