You hit a pingd bug. The counter wraps around pingd wasn't able to handle it. This and the logging is fixed for the next version.
On Thu, Feb 12, 2009 at 11:33, Tim Verhoeven <[email protected]> wrote: > Hi, > > I had a strange problem with one of my clusters last night. As far as > I can see it it seems that the pingd fails to do successful pings > after a certain period. Here is part of the logs from 1 node of my 2 > node cluster : > > Feb 12 03:37:21 slave1 pingd: [4744]: info: stand_alone_ping: Node > 10.0.1.9 is alive > Feb 12 03:37:21 slave1 pingd: [4744]: info: send_update: 1 active ping nodes > Feb 12 03:37:31 slave1 pingd: [4744]: info: stand_alone_ping: Node > 10.0.1.9 is alive > Feb 12 03:37:31 slave1 pingd: [4744]: info: send_update: 1 active ping nodes > Feb 12 03:37:41 slave1 pingd: [4744]: info: ping_read: Retrying... > Feb 12 03:37:47 slave1 pingd: [4744]: info: ping_read: Retrying... > Feb 12 03:37:53 slave1 pingd: [4744]: info: ping_read: Retrying... > Feb 12 03:37:59 slave1 pingd: [4744]: info: ping_read: Retrying... > Feb 12 03:38:05 slave1 pingd: [4744]: info: ping_read: Retrying... > Feb 12 03:38:11 slave1 pingd: [4744]: info: send_update: 0 active ping nodes > Feb 12 03:38:16 slave1 attrd: [4423]: info: attrd_trigger_update: > Sending flush op to all hosts for: pingd > Feb 12 03:38:16 slave1 attrd: [4423]: info: attrd_ha_callback: flush > message from slave1.ab.ns.dns.be > > In this case the second node was still running and took over the IP > resources as defined by the constraints of my setup. > > But then at 04:53:40 the second node start behaving in the same way. > Now the IP resources just get shutdown and my cluster is dead. There > are 2 strange things about this. First, I was still able to ping the > host I defined in the pingd config (it's the gateway to internet) > using the standard ping command. Second, the time difference between > both nodes stopping is close to the time difference between starting > heartbeat on both nodes (about 4 hours difference). > > These 2 things gives me the impression that there was something wrong > with pingd and not with the network. And that the pingd stops working > after a certain period. I use the heartbeat and pacemaker packages > from the opensuse build service and these are the versions installed : > > heartbeat-common-2.99.2-6.1 > heartbeat-resources-2.99.2-6.1 > libheartbeat2-2.99.2-6.1 > heartbeat-2.99.2-6.1 > libpacemaker3-1.0.1-3.1 > pacemaker-1.0.1-3.1 > > This is my cib.xml : > > <cib admin_epoch="0" epoch="104" num_updates="0" > validate-with="pacemaker-1.0" have-quorum="true" crm_feature_set="3.0" > dc-uuid="f39c6a4f-816e-419e-8499-ddb61e8c3515" cib-last-written="Wed > Feb 4 1 > 0:17:15 2009"> > <configuration> > <crm_config> > <cluster_property_set id="cib-bootstrap-options"> > <nvpair id="cib-bootstrap-options-stonith-enabled" > name="stonith-enabled" value="0"/> > <nvpair id="cib-bootstrap-options-dc-version" > name="dc-version" value="1.0.1-node: > 6fc5ce8302abf145a02891ec41e5a492efbe8efe"/> > </cluster_property_set> > </crm_config> > <nodes> > </nodes> > <resources> > <clone id="clone_named1"> > <meta_attributes id="clone1_meta_attrs"> > <nvpair id="clone1_metaattr_clone_max" name="clone-max" value="2"/> > <nvpair id="clone1_metaattr_clone_node_max" > name="clone-node-max" value="1"/> > </meta_attributes> > <primitive id="resource_named1" class="ocf" type="named" > provider="dns"> > <operations> > <op id="resource_named1_monitor" name="monitor" interval="60s"/> > <op id="resource_named1_start" name="start" interval="0s" > timeout="120s"/> > <op id="resource_named1_stop" name="stop" interval="0s" > timeout="120s"/> > </operations> > <meta_attributes id="primitive-resource_named1.meta"/> > <instance_attributes id="resource_named1_instance_attrs"> > <nvpair id="resource_named1_ip" name="ip" > value="10.0.1.130,10.0.1.131"/> > </instance_attributes> > </primitive> > </clone> > <clone id="clone_ip200"> > <meta_attributes id="clone2_meta_attrs"> > <nvpair id="clone2_metaattr_clone_max" name="clone-max" value="2"/> > <nvpair id="clone2_metaattr_clone_node_max" > name="clone-node-max" value="2"/> > <nvpair id="clone_ip200_metaattr_target_role" > name="target-role" value="started"/> > <nvpair id="clone_ip200-meta_attributes-resource-stickiness" > name="resource-stickiness" value="0"/> > </meta_attributes> > <primitive id="resource_ip200" class="ocf" type="IPaddr2" > provider="heartbeat"> > <operations> > <op id="resource_ip200_start" name="start" interval="0s" > timeout="30s"/> > <op id="resource_ip200_stop" name="stop" interval="0s" > timeout="30s"/> > </operations> > <instance_attributes id="resource_ip200_instance_attrs"> > <nvpair id="resource_ip200_ip" name="ip" value="10.0.1.130"/> > <nvpair id="resource_ip200_hash" name="clusterip_hash" > value="sourceip-sourceport"/> > </instance_attributes> > <meta_attributes id="resource_ip200_meta_attrs"/> > </primitive> > </clone> > <clone id="clone_ip201"> > <meta_attributes id="clone201_meta_attrs"> > <nvpair id="clone201_metaattr_clone_max" name="clone-max" value="2"/> > <nvpair id="clone201_metaattr_clone_node_max" > name="clone-node-max" value="2"/> > <nvpair id="clone_ip201_metaattr_target_role" > name="target-role" value="started"/> > <nvpair id="clone_ip201-meta_attributes-resource-stickiness" > name="resource-stickiness" value="0"/> > </meta_attributes> > <primitive id="resource_ip201" class="ocf" type="IPaddr2" > provider="heartbeat"> > <operations> > <op id="resource_ip201_start" name="start" interval="0s" > timeout="30s"/> > <op id="resource_ip201_stop" name="stop" interval="0s" > timeout="30s"/> > </operations> > <instance_attributes id="resource_ip201_instance_attrs"> > <nvpair id="resource_ip201_ip" name="ip" value="10.0.1.131"/> > <nvpair id="resource_ip201_hash" name="clusterip_hash" > value="sourceip-sourceport"/> > </instance_attributes> > <meta_attributes id="resource_ip201_meta_attrs"/> > </primitive> > </clone> > <clone id="pingd-clone"> > <primitive id="pingd" provider="pacemaker" class="ocf" type="pingd"> > <instance_attributes id="pingd-attrs"> > <nvpair id="pingd-dampen" name="dampen" value="5s"/> > <nvpair id="pingd-multiplier" name="multiplier" value="1000"/> > <nvpair id="pingd-hosts" name="host_list" value="10.0.1.9"/> > </instance_attributes> > </primitive> > </clone> > </resources> > <constraints> > <rsc_order id="named_before_ip200" first="clone_named1" > then="clone_ip200" then-action="start" first-action="start"/> > <rsc_order id="ip200_before_ip201" first="clone_ip200" > then="clone_ip201" then-action="start" first-action="start"/> > <rsc_location id="clone_ip200-connectivity" rsc="clone_ip200"> > <rule id="pingd-200-exclude-rule-not-defined" > score="-INFINITY" boolean-op="or"> > <expression id="pingd-200-exclude-not-defined" > attribute="pingd" operation="not_defined"/> > </rule> > <rule id="pingd-200-exclude-rule-less-then-1" > score="-INFINITY" boolean-op="or"> > <expression id="pingd-200-exclude-less-then-1" > attribute="pingd" operation="lt" value="1"/> > </rule> > </rsc_location> > <rsc_location id="clone_ip201-connectivity" rsc="clone_ip201"> > <rule id="pingd-201-exclude-rule-not-defined" > score="-INFINITY" boolean-op="or"> > <expression id="pingd-201-exclude-not-defined" > attribute="pingd" operation="not_defined"/> > </rule> > <rule id="pingd-201-exclude-rule-less-then-1" > score="-INFINITY" boolean-op="or"> > <expression id="pingd-201-exclude-less-then-1" > attribute="pingd" operation="lt" value="1"/> > </rule> > </rsc_location> > </constraints> > </configuration> > </cib> > > Has anyone seen this before ? I have a second cluster setup ready that > needs to be shippend out soon. I can use that one for testing if > required. On the production cluster that failed I have currently > removed the pingd resource and relevant constrains. > > Thanks for your help, > Tim > > P.S. Is there any way to make pingd less verbose ? I don't really need > to know each it id did a successfull ping. Only when things change > and/or fail a log entry is required. > > -- > Tim Verhoeven - [email protected] - 0479 / 88 11 83 > > Hoping the problem magically goes away by ignoring it is the > "microsoft approach to programming" and should never be allowed. > (Linus Torvalds) > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
