Re: [Linux-HA] ocf:heartbeat:apache resource agent and timeouts

Dejan Muhamedagic Wed, 04 Apr 2012 08:56:31 -0700

Hi,

On Tue, Apr 03, 2012 at 01:53:41PM +0200, David Gubler wrote:
> Hi list,
> 
> I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10 
> (Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue 
> with the apache resource agent.
> 
> But first things first, my test setup:
> 
> root@node0:~# crm configure show
> node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1
> node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0
> primitive apache ocf:heartbeat:apache \
>          params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\
>          op monitor interval="30" timeout="120" \
>          meta is-managed="false"
> primitive site0ip ocf:heartbeat:IPaddr \
>          params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0"
> primitive site1ip ocf:heartbeat:IPaddr \
>          params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0"
> clone apacheClone apache
> colocation bothips -100: site0ip site1ip
> colocation site0 inf: site0ip apacheClone
> colocation site1 inf: site1ip apacheClone
> property $id="cib-bootstrap-options" \
>          dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>          cluster-infrastructure="Heartbeat" \
>          no-quorum-policy="ignore" \
>          stonith-enabled="false" \
>          last-lrm-refresh="1333391544" \
>          cluster-recheck-interval="15min"
> 
> 
> 
> One of the test I did was simulate a messed up apache (e.g. connection 
> limit reached):
> $ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP
> 
> Of course, this should produce a monitor timeout, which should mark the 
> apache as failed, and that's what happened.
> 
> However, recovery didn't work after I did
> $ iptables -F
> 
> The problem, according to what I could figure out:
> The apache resource agent
> /usr/lib/ocf/resource.d/heartbeat/apache
> does not have a timeout set for curl/wget. Curl has a default timeout of 
> about 3 minutes, wget may even retry up to 20 times and thus may 
> potentially take ages to time out.
> 
> Thus, the monitor operation did time out instead of wget (thus, 
> pacemaker thinks that the monitor itself has failed instead of the 
> service it is monitoring, which is semantically just plain wrong, IMHO).


The timeout is a timeout, wherever it happens.

> Since the resource agent let the (still waiting) wget process hang 
> around practically forever, it also didn't notice when apache had 
> recovered (after iptables -f).

So, you want the resource agent to notice while running monitor
that it can now talk to the server?

> 
> Bottom line:
> I think the apache resource agent badly needs a timeout parameter which 
> is supplied to wget/curl and the documentation should make clear that 
> the current monitor timeout provided by pacemaker is not a substitute 
> for that (it cannot really be used to detect non-responsive web 
> servers). I only figured that out after extensive testing and finally 
> looking at the source, which took an awful lot of time.
> 
> After implementing a workaround:
> WGETOPTS="-O- -q -L -T 5 -t 1 --no-proxy --bind-address=127.0.0.1"
> (added -T 5 -t 1) pacemaker and the apache resource behaved as expected 
> even when doing the iptables test above, and apache quickly recovers 
> after I do iptables -F.

Indeed in this case specifying a short timeout for the client
would speed things up. It should loop indefinitely in the
monitor op. We may accept a patch :)

> On a side note:
> The apache resource agent allows to supply a config file, where one can 
> override the parameters for curl/wget. But the implementation here is 
> bogus, because even if you supply this file, it always does a default 
> test with default parameters first, so this is useless in this case... 
> (I consider this behavior to be a bug).

If you use a config test file, you'd need to define a monitor
with depth 10. The depth 0 monitor (default) is always testing
the statusurl.

> Side note II:
> I did play a lot with on-fail=..., failure-timeout=, 
> cluster-recheck-interval=... Changing these values did not help, but in 
> some cases produced new weird behavior, e.g. in some cases pacemaker 
> didn't even notice that apache was unreachable...

Cheers,

Dejan

> Best regards,
> 
> David
> 
> -- 
> David Gubler
> Senior Software & Operations Engineer
> MeetMe: http://doodle.com/david
> E-Mail: [email protected]
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] ocf:heartbeat:apache resource agent and timeouts

Reply via email to