Re: [Linux-HA] Pingd stops working after a certain time

Andrew Beekhof Fri, 13 Feb 2009 08:42:30 -0800

You hit a pingd bug.  The counter wraps around pingd wasn't able to handle it.
This and the logging is fixed for the next version.


On Thu, Feb 12, 2009 at 11:33, Tim Verhoeven <[email protected]> wrote:
> Hi,
>
> I had a strange problem with one of my clusters last night. As far as
> I can see it it seems that the pingd fails to do successful pings
> after a certain period. Here is part of the logs from 1 node of my 2
> node cluster :
>
> Feb 12 03:37:21 slave1 pingd: [4744]: info: stand_alone_ping: Node
> 10.0.1.9 is alive
> Feb 12 03:37:21 slave1 pingd: [4744]: info: send_update: 1 active ping nodes
> Feb 12 03:37:31 slave1 pingd: [4744]: info: stand_alone_ping: Node
> 10.0.1.9 is alive
> Feb 12 03:37:31 slave1 pingd: [4744]: info: send_update: 1 active ping nodes
> Feb 12 03:37:41 slave1 pingd: [4744]: info: ping_read: Retrying...
> Feb 12 03:37:47 slave1 pingd: [4744]: info: ping_read: Retrying...
> Feb 12 03:37:53 slave1 pingd: [4744]: info: ping_read: Retrying...
> Feb 12 03:37:59 slave1 pingd: [4744]: info: ping_read: Retrying...
> Feb 12 03:38:05 slave1 pingd: [4744]: info: ping_read: Retrying...
> Feb 12 03:38:11 slave1 pingd: [4744]: info: send_update: 0 active ping nodes
> Feb 12 03:38:16 slave1 attrd: [4423]: info: attrd_trigger_update:
> Sending flush op to all hosts for: pingd
> Feb 12 03:38:16 slave1 attrd: [4423]: info: attrd_ha_callback: flush
> message from slave1.ab.ns.dns.be
>
> In this case the second node was still running and took over the IP
> resources as defined by the constraints of my setup.
>
> But then at 04:53:40 the second node start behaving in the same way.
> Now the IP resources just get shutdown and my cluster is dead. There
> are 2 strange things about this. First, I was still able to ping the
> host I defined in the pingd config (it's the gateway to internet)
> using the standard ping command. Second, the time difference between
> both nodes stopping is close to the time difference between starting
> heartbeat on both nodes (about 4 hours difference).
>
> These 2 things gives me the impression that there was something wrong
> with pingd and not with the network. And that the pingd stops working
> after a certain period. I use the heartbeat and pacemaker packages
> from the opensuse build service and these are the versions installed :
>
> heartbeat-common-2.99.2-6.1
> heartbeat-resources-2.99.2-6.1
> libheartbeat2-2.99.2-6.1
> heartbeat-2.99.2-6.1
> libpacemaker3-1.0.1-3.1
> pacemaker-1.0.1-3.1
>
> This is my cib.xml :
>
> <cib admin_epoch="0" epoch="104" num_updates="0"
> validate-with="pacemaker-1.0" have-quorum="true" crm_feature_set="3.0"
> dc-uuid="f39c6a4f-816e-419e-8499-ddb61e8c3515" cib-last-written="Wed
> Feb  4 1
> 0:17:15 2009">
>  <configuration>
>    <crm_config>
>      <cluster_property_set id="cib-bootstrap-options">
>        <nvpair id="cib-bootstrap-options-stonith-enabled"
> name="stonith-enabled" value="0"/>
>        <nvpair id="cib-bootstrap-options-dc-version"
> name="dc-version" value="1.0.1-node:
> 6fc5ce8302abf145a02891ec41e5a492efbe8efe"/>
>      </cluster_property_set>
>    </crm_config>
>    <nodes>
>    </nodes>
>    <resources>
>      <clone id="clone_named1">
>        <meta_attributes id="clone1_meta_attrs">
>          <nvpair id="clone1_metaattr_clone_max" name="clone-max" value="2"/>
>          <nvpair id="clone1_metaattr_clone_node_max"
> name="clone-node-max" value="1"/>
>        </meta_attributes>
>        <primitive id="resource_named1" class="ocf" type="named" 
> provider="dns">
>          <operations>
>            <op id="resource_named1_monitor" name="monitor" interval="60s"/>
>            <op id="resource_named1_start" name="start" interval="0s"
> timeout="120s"/>
>            <op id="resource_named1_stop" name="stop" interval="0s"
> timeout="120s"/>
>          </operations>
>          <meta_attributes id="primitive-resource_named1.meta"/>
>          <instance_attributes id="resource_named1_instance_attrs">
>            <nvpair id="resource_named1_ip" name="ip"
> value="10.0.1.130,10.0.1.131"/>
>          </instance_attributes>
>        </primitive>
>      </clone>
>      <clone id="clone_ip200">
>        <meta_attributes id="clone2_meta_attrs">
>          <nvpair id="clone2_metaattr_clone_max" name="clone-max" value="2"/>
>          <nvpair id="clone2_metaattr_clone_node_max"
> name="clone-node-max" value="2"/>
>          <nvpair id="clone_ip200_metaattr_target_role"
> name="target-role" value="started"/>
>          <nvpair id="clone_ip200-meta_attributes-resource-stickiness"
> name="resource-stickiness" value="0"/>
>        </meta_attributes>
>        <primitive id="resource_ip200" class="ocf" type="IPaddr2"
> provider="heartbeat">
>          <operations>
>            <op id="resource_ip200_start" name="start" interval="0s"
> timeout="30s"/>
>            <op id="resource_ip200_stop" name="stop" interval="0s"
> timeout="30s"/>
>          </operations>
>          <instance_attributes id="resource_ip200_instance_attrs">
>            <nvpair id="resource_ip200_ip" name="ip" value="10.0.1.130"/>
>            <nvpair id="resource_ip200_hash" name="clusterip_hash"
> value="sourceip-sourceport"/>
>          </instance_attributes>
>          <meta_attributes id="resource_ip200_meta_attrs"/>
>        </primitive>
>      </clone>
>      <clone id="clone_ip201">
>        <meta_attributes id="clone201_meta_attrs">
>          <nvpair id="clone201_metaattr_clone_max" name="clone-max" value="2"/>
>          <nvpair id="clone201_metaattr_clone_node_max"
> name="clone-node-max" value="2"/>
>          <nvpair id="clone_ip201_metaattr_target_role"
> name="target-role" value="started"/>
>          <nvpair id="clone_ip201-meta_attributes-resource-stickiness"
> name="resource-stickiness" value="0"/>
>        </meta_attributes>
>        <primitive id="resource_ip201" class="ocf" type="IPaddr2"
> provider="heartbeat">
>          <operations>
>            <op id="resource_ip201_start" name="start" interval="0s"
> timeout="30s"/>
>            <op id="resource_ip201_stop" name="stop" interval="0s"
> timeout="30s"/>
>          </operations>
>          <instance_attributes id="resource_ip201_instance_attrs">
>            <nvpair id="resource_ip201_ip" name="ip" value="10.0.1.131"/>
>            <nvpair id="resource_ip201_hash" name="clusterip_hash"
> value="sourceip-sourceport"/>
>          </instance_attributes>
>          <meta_attributes id="resource_ip201_meta_attrs"/>
>        </primitive>
>      </clone>
>      <clone id="pingd-clone">
>        <primitive id="pingd" provider="pacemaker" class="ocf" type="pingd">
>          <instance_attributes id="pingd-attrs">
>            <nvpair id="pingd-dampen" name="dampen" value="5s"/>
>            <nvpair id="pingd-multiplier" name="multiplier" value="1000"/>
>            <nvpair id="pingd-hosts" name="host_list" value="10.0.1.9"/>
>          </instance_attributes>
>        </primitive>
>      </clone>
>    </resources>
>    <constraints>
>      <rsc_order id="named_before_ip200" first="clone_named1"
> then="clone_ip200" then-action="start" first-action="start"/>
>      <rsc_order id="ip200_before_ip201" first="clone_ip200"
> then="clone_ip201" then-action="start" first-action="start"/>
>      <rsc_location id="clone_ip200-connectivity" rsc="clone_ip200">
>        <rule id="pingd-200-exclude-rule-not-defined"
> score="-INFINITY" boolean-op="or">
>          <expression id="pingd-200-exclude-not-defined"
> attribute="pingd" operation="not_defined"/>
>        </rule>
>        <rule id="pingd-200-exclude-rule-less-then-1"
> score="-INFINITY" boolean-op="or">
>          <expression id="pingd-200-exclude-less-then-1"
> attribute="pingd" operation="lt" value="1"/>
>        </rule>
>      </rsc_location>
>      <rsc_location id="clone_ip201-connectivity" rsc="clone_ip201">
>        <rule id="pingd-201-exclude-rule-not-defined"
> score="-INFINITY" boolean-op="or">
>          <expression id="pingd-201-exclude-not-defined"
> attribute="pingd" operation="not_defined"/>
>        </rule>
>        <rule id="pingd-201-exclude-rule-less-then-1"
> score="-INFINITY" boolean-op="or">
>          <expression id="pingd-201-exclude-less-then-1"
> attribute="pingd" operation="lt" value="1"/>
>        </rule>
>      </rsc_location>
>    </constraints>
>  </configuration>
> </cib>
>
> Has anyone seen this before ? I have a second cluster setup ready that
> needs to be shippend out soon. I can use that one for testing if
> required. On the production cluster that failed I have currently
> removed the pingd resource and relevant constrains.
>
> Thanks for your help,
> Tim
>
> P.S. Is there any way to make pingd less verbose ? I don't really need
> to know each it id did a successfull ping. Only when things change
> and/or fail a log entry is required.
>
> --
> Tim Verhoeven - [email protected] - 0479 / 88 11 83
>
> Hoping the problem  magically goes away  by ignoring it is the
> "microsoft approach to programming" and should never be allowed.
> (Linus Torvalds)
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Pingd stops working after a certain time

Reply via email to