On 05.04.2012 17:14, Dejan Muhamedagic wrote:
> Hmm, the process running the monitor operation should be removed
> (killed) by lrmd on timeout. If that doesn't happen, then you
> just hit a jackpot bug!
Ok, that's crucial information I've been missing, and thus I
misinterpreted my test results. Back to square one...
TEST 1: *Unpatched* Apache resource agent with this configuration:
root@node2:/etc/ha.d# crm configure show
node $id="aa9dea56-ae1e-42a9-a37b-f7c9f5dc5860" node1
node $id="aec6cf09-e141-415d-8957-a7b94e09df7f" node2
primitive apache ocf:heartbeat:apache \
params statusurl="http://localhost/server-status" \
op monitor interval="15s" timeout="5s" \
meta is-managed="false"
clone apacheClone apache
property $id="cib-bootstrap-options" \
dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1333886776"
crm_mon shows
Clone Set: apacheClone [apache]
apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged)
apache:1 (ocf::heartbeat:apache): Started node1 (unmanaged)
Thus all is well.
Now I do
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP
After a few seconds, crm_mon shows
Clone Set: apacheClone [apache]
apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged)
apache:1 (ocf::heartbeat:apache): Started node1
(unmanaged) FAILED
Failed actions:
apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed
Out): unknown exec error
Using ps aux, I can see that the monitor and wget is started every 15s
and running up to the timeout, and then killed, just as you said. So far
so good.
Now I remove the iptables rule:
$ iptables -F
But no matter how long I wait, Pacemaker *doesn't* notice that Apache is
back! Even though the monitor is definitely executed (I can see the
request in Apache's log file). Also, crm_mon keeps saying
Failed actions:
apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed
Out): unknown exec error
The counters don't change (!)
If I manually do
$ crm resource cleanup apacheClone
then everything is fine again.
TEST 2: *Patched* Apache resource agent with the same configuration.
root@node1:/usr/lib/ocf/resource.d/heartbeat# diff apache apache.orig
66c66
< WGETOPTS="-O- -q -L --no-proxy -T 3 -t 1 --bind-address=127.0.0.1"
---
> WGETOPTS="-O- -q -L --no-proxy --bind-address=127.0.0.1"
So all I did was add two options to wget's command line.
Again, crm_mon shows that all is well.
Again I do
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP
Now crm_mon shows
Clone Set: apacheClone [apache]
apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged)
apache:1 (ocf::heartbeat:apache): Started node1
(unmanaged) FAILED
Failed actions:
apache:0_monitor_15000 (node=node1, call=13, rc=1,
status=complete): unknown error
NOTE: The "Failed actions" are different from the test before!
Now I remove the iptables rule:
$ iptables -F
After a few seconds, the clone set is back to working state.
Thus, what I'm seeing here:
It does make a difference to Pacemaker whether the monitor operation
returns failure or times out.
Monitor times out:
* apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed Out):
unknown exec error
* Monitor operation and wget both get killed when the timeout happens
(just as they should)
* Monitor operation keeps getting executed (and presumably returns
success), but this is ignored (!) by Pacemaker
Monitor returns failure (due to wget's timeout):
* apache:0_monitor_15000 (node=node1, call=13, rc=1, status=complete):
unknown error
* Monitor operation and wget don't need to be killed, because they time
out and complete before the whole monitor operation times out
* Monitor operation keeps getting executed, and on first success
Pacemakers notices and puts apache back into working state
The big question here is: Is this a bug in Pacemaker or by design?
> Hmm, I though we were past this... and I still don't see the
> patch :)
I'm still not sure what the actual problem is. Currently I feel like
it's a bug in Pacemaker, and my "fix" for the apache resource agent is
just fighting symptoms.
Sorry for the confusion - This Heartbeat/Pacemaker thing is very hard to
understand.
Best regards,
David
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems