On 05.04.2012 17:14, Dejan Muhamedagic wrote:
> Hmm, the process running the monitor operation should be removed
> (killed) by lrmd on timeout. If that doesn't happen, then you
> just hit a jackpot bug!

Ok, that's crucial information I've been missing, and thus I 
misinterpreted my test results. Back to square one...

TEST 1: *Unpatched* Apache resource agent with this configuration:

root@node2:/etc/ha.d# crm configure show
node $id="aa9dea56-ae1e-42a9-a37b-f7c9f5dc5860" node1
node $id="aec6cf09-e141-415d-8957-a7b94e09df7f" node2
primitive apache ocf:heartbeat:apache \
     params statusurl="http://localhost/server-status"; \
     op monitor interval="15s" timeout="5s" \
     meta is-managed="false"
clone apacheClone apache
property $id="cib-bootstrap-options" \
     dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
     cluster-infrastructure="Heartbeat" \
     stonith-enabled="false" \
     no-quorum-policy="ignore" \
     last-lrm-refresh="1333886776"


crm_mon shows
  Clone Set: apacheClone [apache]
      apache:0   (ocf::heartbeat:apache):        Started node2 (unmanaged)
      apache:1   (ocf::heartbeat:apache):        Started node1 (unmanaged)
Thus all is well.

Now I do
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP

After a few seconds, crm_mon shows
  Clone Set: apacheClone [apache]
      apache:0   (ocf::heartbeat:apache):        Started node2 (unmanaged)
      apache:1   (ocf::heartbeat:apache):        Started node1 
(unmanaged) FAILED
Failed actions:
     apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed 
Out): unknown exec error
Using ps aux, I can see that the monitor and wget is started every 15s 
and running up to the timeout, and then killed, just as you said. So far 
so good.

Now I remove the iptables rule:
$ iptables -F

But no matter how long I wait, Pacemaker *doesn't* notice that Apache is 
back! Even though the monitor is definitely executed (I can see the 
request in Apache's log file). Also, crm_mon keeps saying
Failed actions:
     apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed 
Out): unknown exec error
The counters don't change (!)

If I manually do
$ crm resource cleanup apacheClone
then everything is fine again.



TEST 2: *Patched* Apache resource agent with the same configuration.
root@node1:/usr/lib/ocf/resource.d/heartbeat# diff apache apache.orig
66c66
< WGETOPTS="-O- -q -L --no-proxy -T 3 -t 1 --bind-address=127.0.0.1"
---
 > WGETOPTS="-O- -q -L --no-proxy --bind-address=127.0.0.1"
So all I did was add two options to wget's command line.

Again, crm_mon shows that all is well.
Again I do
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP

Now crm_mon shows
Clone Set: apacheClone [apache]
      apache:0   (ocf::heartbeat:apache):        Started node2 (unmanaged)
      apache:1   (ocf::heartbeat:apache):        Started node1 
(unmanaged) FAILED
Failed actions:
     apache:0_monitor_15000 (node=node1, call=13, rc=1, 
status=complete): unknown error
NOTE: The "Failed actions" are different from the test before!

Now I remove the iptables rule:
$ iptables -F

After a few seconds, the clone set is back to working state.



Thus, what I'm seeing here:

It does make a difference to Pacemaker whether the monitor operation 
returns failure or times out.

Monitor times out:
* apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed Out): 
unknown exec error
* Monitor operation and wget both get killed when the timeout happens 
(just as they should)
* Monitor operation keeps getting executed (and presumably returns 
success), but this is ignored (!) by Pacemaker

Monitor returns failure (due to wget's timeout):
* apache:0_monitor_15000 (node=node1, call=13, rc=1, status=complete): 
unknown error
* Monitor operation and wget don't need to be killed, because they time 
out and complete before the whole monitor operation times out
* Monitor operation keeps getting executed, and on first success 
Pacemakers notices and puts apache back into working state


The big question here is: Is this a bug in Pacemaker or by design?


> Hmm, I though we were past this... and I still don't see the
> patch :)
I'm still not sure what the actual problem is. Currently I feel like 
it's a bug in Pacemaker, and my "fix" for the apache resource agent is 
just fighting symptoms.

Sorry for the confusion - This Heartbeat/Pacemaker thing is very hard to 
understand.

Best regards,

David
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to