Hi,
I had a strange problem with one of my clusters last night. As far as
I can see it it seems that the pingd fails to do successful pings
after a certain period. Here is part of the logs from 1 node of my 2
node cluster :
Feb 12 03:37:21 slave1 pingd: [4744]: info: stand_alone_ping: Node
10.0.1.9 is alive
Feb 12 03:37:21 slave1 pingd: [4744]: info: send_update: 1 active ping nodes
Feb 12 03:37:31 slave1 pingd: [4744]: info: stand_alone_ping: Node
10.0.1.9 is alive
Feb 12 03:37:31 slave1 pingd: [4744]: info: send_update: 1 active ping nodes
Feb 12 03:37:41 slave1 pingd: [4744]: info: ping_read: Retrying...
Feb 12 03:37:47 slave1 pingd: [4744]: info: ping_read: Retrying...
Feb 12 03:37:53 slave1 pingd: [4744]: info: ping_read: Retrying...
Feb 12 03:37:59 slave1 pingd: [4744]: info: ping_read: Retrying...
Feb 12 03:38:05 slave1 pingd: [4744]: info: ping_read: Retrying...
Feb 12 03:38:11 slave1 pingd: [4744]: info: send_update: 0 active ping nodes
Feb 12 03:38:16 slave1 attrd: [4423]: info: attrd_trigger_update:
Sending flush op to all hosts for: pingd
Feb 12 03:38:16 slave1 attrd: [4423]: info: attrd_ha_callback: flush
message from slave1.ab.ns.dns.be
In this case the second node was still running and took over the IP
resources as defined by the constraints of my setup.
But then at 04:53:40 the second node start behaving in the same way.
Now the IP resources just get shutdown and my cluster is dead. There
are 2 strange things about this. First, I was still able to ping the
host I defined in the pingd config (it's the gateway to internet)
using the standard ping command. Second, the time difference between
both nodes stopping is close to the time difference between starting
heartbeat on both nodes (about 4 hours difference).
These 2 things gives me the impression that there was something wrong
with pingd and not with the network. And that the pingd stops working
after a certain period. I use the heartbeat and pacemaker packages
from the opensuse build service and these are the versions installed :
heartbeat-common-2.99.2-6.1
heartbeat-resources-2.99.2-6.1
libheartbeat2-2.99.2-6.1
heartbeat-2.99.2-6.1
libpacemaker3-1.0.1-3.1
pacemaker-1.0.1-3.1
This is my cib.xml :
<cib admin_epoch="0" epoch="104" num_updates="0"
validate-with="pacemaker-1.0" have-quorum="true" crm_feature_set="3.0"
dc-uuid="f39c6a4f-816e-419e-8499-ddb61e8c3515" cib-last-written="Wed
Feb 4 1
0:17:15 2009">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-stonith-enabled"
name="stonith-enabled" value="0"/>
<nvpair id="cib-bootstrap-options-dc-version"
name="dc-version" value="1.0.1-node:
6fc5ce8302abf145a02891ec41e5a492efbe8efe"/>
</cluster_property_set>
</crm_config>
<nodes>
</nodes>
<resources>
<clone id="clone_named1">
<meta_attributes id="clone1_meta_attrs">
<nvpair id="clone1_metaattr_clone_max" name="clone-max" value="2"/>
<nvpair id="clone1_metaattr_clone_node_max"
name="clone-node-max" value="1"/>
</meta_attributes>
<primitive id="resource_named1" class="ocf" type="named" provider="dns">
<operations>
<op id="resource_named1_monitor" name="monitor" interval="60s"/>
<op id="resource_named1_start" name="start" interval="0s"
timeout="120s"/>
<op id="resource_named1_stop" name="stop" interval="0s"
timeout="120s"/>
</operations>
<meta_attributes id="primitive-resource_named1.meta"/>
<instance_attributes id="resource_named1_instance_attrs">
<nvpair id="resource_named1_ip" name="ip"
value="10.0.1.130,10.0.1.131"/>
</instance_attributes>
</primitive>
</clone>
<clone id="clone_ip200">
<meta_attributes id="clone2_meta_attrs">
<nvpair id="clone2_metaattr_clone_max" name="clone-max" value="2"/>
<nvpair id="clone2_metaattr_clone_node_max"
name="clone-node-max" value="2"/>
<nvpair id="clone_ip200_metaattr_target_role"
name="target-role" value="started"/>
<nvpair id="clone_ip200-meta_attributes-resource-stickiness"
name="resource-stickiness" value="0"/>
</meta_attributes>
<primitive id="resource_ip200" class="ocf" type="IPaddr2"
provider="heartbeat">
<operations>
<op id="resource_ip200_start" name="start" interval="0s"
timeout="30s"/>
<op id="resource_ip200_stop" name="stop" interval="0s"
timeout="30s"/>
</operations>
<instance_attributes id="resource_ip200_instance_attrs">
<nvpair id="resource_ip200_ip" name="ip" value="10.0.1.130"/>
<nvpair id="resource_ip200_hash" name="clusterip_hash"
value="sourceip-sourceport"/>
</instance_attributes>
<meta_attributes id="resource_ip200_meta_attrs"/>
</primitive>
</clone>
<clone id="clone_ip201">
<meta_attributes id="clone201_meta_attrs">
<nvpair id="clone201_metaattr_clone_max" name="clone-max" value="2"/>
<nvpair id="clone201_metaattr_clone_node_max"
name="clone-node-max" value="2"/>
<nvpair id="clone_ip201_metaattr_target_role"
name="target-role" value="started"/>
<nvpair id="clone_ip201-meta_attributes-resource-stickiness"
name="resource-stickiness" value="0"/>
</meta_attributes>
<primitive id="resource_ip201" class="ocf" type="IPaddr2"
provider="heartbeat">
<operations>
<op id="resource_ip201_start" name="start" interval="0s"
timeout="30s"/>
<op id="resource_ip201_stop" name="stop" interval="0s"
timeout="30s"/>
</operations>
<instance_attributes id="resource_ip201_instance_attrs">
<nvpair id="resource_ip201_ip" name="ip" value="10.0.1.131"/>
<nvpair id="resource_ip201_hash" name="clusterip_hash"
value="sourceip-sourceport"/>
</instance_attributes>
<meta_attributes id="resource_ip201_meta_attrs"/>
</primitive>
</clone>
<clone id="pingd-clone">
<primitive id="pingd" provider="pacemaker" class="ocf" type="pingd">
<instance_attributes id="pingd-attrs">
<nvpair id="pingd-dampen" name="dampen" value="5s"/>
<nvpair id="pingd-multiplier" name="multiplier" value="1000"/>
<nvpair id="pingd-hosts" name="host_list" value="10.0.1.9"/>
</instance_attributes>
</primitive>
</clone>
</resources>
<constraints>
<rsc_order id="named_before_ip200" first="clone_named1"
then="clone_ip200" then-action="start" first-action="start"/>
<rsc_order id="ip200_before_ip201" first="clone_ip200"
then="clone_ip201" then-action="start" first-action="start"/>
<rsc_location id="clone_ip200-connectivity" rsc="clone_ip200">
<rule id="pingd-200-exclude-rule-not-defined"
score="-INFINITY" boolean-op="or">
<expression id="pingd-200-exclude-not-defined"
attribute="pingd" operation="not_defined"/>
</rule>
<rule id="pingd-200-exclude-rule-less-then-1"
score="-INFINITY" boolean-op="or">
<expression id="pingd-200-exclude-less-then-1"
attribute="pingd" operation="lt" value="1"/>
</rule>
</rsc_location>
<rsc_location id="clone_ip201-connectivity" rsc="clone_ip201">
<rule id="pingd-201-exclude-rule-not-defined"
score="-INFINITY" boolean-op="or">
<expression id="pingd-201-exclude-not-defined"
attribute="pingd" operation="not_defined"/>
</rule>
<rule id="pingd-201-exclude-rule-less-then-1"
score="-INFINITY" boolean-op="or">
<expression id="pingd-201-exclude-less-then-1"
attribute="pingd" operation="lt" value="1"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>
Has anyone seen this before ? I have a second cluster setup ready that
needs to be shippend out soon. I can use that one for testing if
required. On the production cluster that failed I have currently
removed the pingd resource and relevant constrains.
Thanks for your help,
Tim
P.S. Is there any way to make pingd less verbose ? I don't really need
to know each it id did a successfull ping. Only when things change
and/or fail a log entry is required.
--
Tim Verhoeven - [email protected] - 0479 / 88 11 83
Hoping the problem magically goes away by ignoring it is the
"microsoft approach to programming" and should never be allowed.
(Linus Torvalds)
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems