Re: [Linux-HA] Network fails on primary node but secondary does not take over

Dejan Muhamedagic Fri, 05 Oct 2007 07:20:56 -0700

Hi,

On Fri, Oct 05, 2007 at 01:50:20PM -0000, Karl Pálsson wrote:
> Hi,
> 
> I have two nodes connected in a heartbeat cluster. They have
> eth0 intended for normal work and eth1 for heartbeat. I
> simulate network failure on the primary node (unplug network
> cable on eth0) and expect heartbeat to failover to the
> secondary node. This does not happen. The primary stays primary
> and the secondary stays ... secondary. The network router
> (which heartbeat is configured to ping) is on the same network
> as eth0.


You also have to configure pingd and a constraint to have the
failover work. See: http://linux-ha.org/pingd or the mailing list
archives.

> /var/log/messages contains:
> Oct  5 13:14:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the 
> first line isn't read in. Maybe the heartbeat does not ouput string correctly 
> for status operation. Or the code (myself) is wrong.

I suspect that this is due to your RA eddy not printing anything
to stdout on monitor/status. It should. Check
http://www.linux-ha.org/HeartbeatResourceAgent

Thanks,

Dejan

> Oct  5 13:15:17 amhs-1 ntpd[2232]: synchronized to LOCAL(0), stratum 10
> Oct  5 13:15:17 amhs-1 ntpd[2232]: kernel time sync enabled 0001
> Oct  5 13:16:11 amhs-1 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is 
> Down
> Oct  5 13:16:16 amhs-1 kernel: drbd0: PingAck did not arrive in time.
> Oct  5 13:16:16 amhs-1 kernel: drbd0: peer( Secondary -> Unknown ) conn( 
> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct  5 13:16:16 amhs-1 kernel: drbd0: Creating new current UUID
> Oct  5 13:16:16 amhs-1 kernel: drbd0: asender terminated
> Oct  5 13:16:16 amhs-1 kernel: drbd0: short read expecting header on sock: 
> r=-512
> Oct  5 13:16:16 amhs-1 kernel: drbd0: tl_clear()
> Oct  5 13:16:16 amhs-1 kernel: drbd0: Connection closed
> Oct  5 13:16:16 amhs-1 kernel: drbd0: Writing meta data super block now.
> Oct  5 13:16:16 amhs-1 kernel: drbd0: conn( NetworkFailure -> Unconnected )
> Oct  5 13:16:16 amhs-1 kernel: drbd0: receiver terminated
> Oct  5 13:16:16 amhs-1 kernel: drbd0: receiver (re)started
> Oct  5 13:16:16 amhs-1 kernel: drbd0: conn( Unconnected -> WFConnection )
> Oct  5 13:16:21 amhs-1 heartbeat: [2337]: WARN: node 10.10.10.8: is dead
> Oct  5 13:16:21 amhs-1 heartbeat: [2337]: info: Link 10.10.10.8:10.10.10.8 
> dead.
> Oct  5 13:16:21 amhs-1 crmd: [2510]: notice: crmd_ha_status_callback: Status 
> update: Node 10.10.10.8 now has status [dead]
> Oct  5 13:16:21 amhs-1 crmd: [2510]: WARN: get_uuid: Could not calculate UUID 
> for 10.10.10.8
> Oct  5 13:16:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the 
> first line isn't read in. Maybe the heartbeat does not ouput string correctly 
> for status operation. Or the code (myself) is wrong.
> Oct  5 13:18:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the 
> first line isn't read in. Maybe the heartbeat does not ouput string correctly 
> for status operation. Or the code (myself) is wrong.
> Oct  5 13:20:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the 
> first line isn't read in. Maybe the heartbeat does not ouput string correctly 
> for status operation. Or the code (myself) is wrong.
> Oct  5 13:22:28 amhs-1 cib: [2506]: info: cib_stats: Processed 71 operations 
> (422.00us average, 0% utilization) in the last 10min
> Oct  5 13:22:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the 
> first line isn't read in. Maybe the heartbeat does not ouput string correctly 
> for status operation. Or the code (myself) is wrong.
> 
> 
> 
> /etc/ha.d/ha.cf contains:
> keepalive 1                          # How long between heartbeats
> deadtime 10                          # How long-to-declare-host-dead?
> warntime 5                           # How long before issuing "late 
> heartbeat" warning?
> initdead 40                          # Very first dead time (initdead)
> udpport 694                          # Portnumber to use
> auto_failback off                    # Remain on the node until that node 
> fails
> #watchdog /dev/watchdog               # If it does not beat for a minute the 
> machine will reboot
> node amhs-1.tern.is                  # Host, member of the cluster, must 
> match uname -n
> node amhs-2.tern.is                  # Host, member of the cluster, must 
> match uname -n
> bcast eth1                           # Broadcast heartbeats on eth1 interface
> ping 10.10.10.8                      # Ping our router to monitor ethernet 
> connectivity
> respawn hacluster /usr/lib/heartbeat/dopd  
> apiauth dopd gid=haclient uid=hacluster
> use_logd yes
> crm yes                              #Enable version 2 functionality 
> supporting clusters with  > 2 nodes
> 
> "ps ax" reveals that dopd is running.
> 
> Heartbeat is of version 2.1.2. 
> 
> The OS is Centos release 5.
> 
> Cib.xml is attached.
> 
> -- 
> Best regards / Bestu kveðjur
> Karl Palsson
> 

Content-Description: cib.xml
>  <cib admin_epoch="0" generated="false" have_quorum="false" 
> ignore_dtd="false" num_peers="0" cib_feature_revision="1.3" epoch="12" 
> num_updates="2" cib-last-written="Fri Oct  5 13:12:28 2007">
>    <configuration>
>      <crm_config>
>        <cluster_property_set id="cib-bootstrap-options">
>          <attributes>
>            <nvpair id="cib-bootstrap-options-symmetric-cluster" 
> name="symmetric-cluster" value="true"/>
>            <nvpair id="cib-bootstrap-options-no-quorum-policy" 
> name="no-quorum-policy" value="stop"/>
>            <nvpair id="cib-bootstrap-options-default-resource-stickiness" 
> name="default-resource-stickiness" value="0"/>
>            <nvpair 
> id="cib-bootstrap-options-default-resource-failure-stickiness" 
> name="default-resource-failure-stickiness" value="0"/>
>            <nvpair id="cib-bootstrap-options-stonith-enabled" 
> name="stonith-enabled" value="false"/>
>            <nvpair id="cib-bootstrap-options-stonith-action" 
> name="stonith-action" value="reboot"/>
>            <nvpair id="cib-bootstrap-options-stop-orphan-resources" 
> name="stop-orphan-resources" value="true"/>
>            <nvpair id="cib-bootstrap-options-stop-orphan-actions" 
> name="stop-orphan-actions" value="true"/>
>            <nvpair id="cib-bootstrap-options-remove-after-stop" 
> name="remove-after-stop" value="false"/>
>            <nvpair id="cib-bootstrap-options-short-resource-names" 
> name="short-resource-names" value="true"/>
>            <nvpair id="cib-bootstrap-options-transition-idle-timeout" 
> name="transition-idle-timeout" value="5min"/>
>            <nvpair id="cib-bootstrap-options-default-action-timeout" 
> name="default-action-timeout" value="15s"/>
>            <nvpair id="cib-bootstrap-options-is-managed-default" 
> name="is-managed-default" value="true"/>
>            <nvpair id="cib-bootstrap-options-last-lrm-refresh" 
> name="last-lrm-refresh" value="1189175050"/>
>          </attributes>
>        </cluster_property_set>
>      </crm_config>
>      <nodes>
>        <node id="08c1a4c6-97fd-4528-8918-73ce91323664" uname="amhs-1.tern.is" 
> type="normal"/>
>        <node id="5884276e-dd43-44d9-88d6-cc7f7cd84ea4" uname="amhs-2.tern.is" 
> type="normal"/>
>      </nodes>
>      <resources>
>        <group id="group_1">
>          <primitive class="ocf" id="IPaddr_10_10_10_220" provider="heartbeat" 
> type="IPaddr">
>            <operations>
>              <op id="IPaddr_10_10_10_220_mon" interval="5s" name="monitor" 
> timeout="5s"/>
>            </operations>
>            <instance_attributes id="IPaddr_10_10_10_220_inst_attr">
>              <attributes>
>                <nvpair id="IPaddr_10_10_10_220_attr_0" name="ip" 
> value="10.10.10.220"/>
>                <nvpair id="IPaddr_10_10_10_220_attr_1" name="netmask" 
> value="24"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>          <primitive class="heartbeat" id="drbddisk_2" provider="heartbeat" 
> type="drbddisk">
>            <operations>
>              <op id="drbddisk_2_mon" interval="120s" name="monitor" 
> timeout="60s"/>
>            </operations>
>            <instance_attributes id="drbddisk_2_inst_attr">
>              <attributes>
>                <nvpair id="drbddisk_2_attr_1" name="1" value="IsodeResource"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>          <primitive class="ocf" id="Filesystem_3" provider="heartbeat" 
> type="Filesystem">
>            <operations>
>              <op id="Filesystem_3_mon" interval="120s" name="monitor" 
> timeout="60s"/>
>            </operations>
>            <instance_attributes id="Filesystem_3_inst_attr">
>              <attributes>
>                <nvpair id="Filesystem_3_attr_0" name="device" 
> value="/dev/drbd0"/>
>                <nvpair id="Filesystem_3_attr_1" name="directory" 
> value="/Isode"/>
>                <nvpair id="Filesystem_3_attr_2" name="fstype" value="ext3"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>          <primitive class="heartbeat" id="eddy_4" provider="heartbeat" 
> type="eddy">
>            <operations>
>              <op id="eddy_4_mon" interval="120s" name="monitor" 
> timeout="60s"/>
>            </operations>
>            <instance_attributes id="eddy_4_instance_attrs">
>              <attributes>
>                <nvpair id="eddy_4_target_role" name="target_role" 
> value="started"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>          <primitive class="heartbeat" id="pumice_5" provider="heartbeat" 
> type="pumice">
>            <operations>
>              <op id="pumice_5_mon" interval="120s" name="monitor" 
> timeout="60s"/>
>            </operations>
>            <instance_attributes id="pumice_5_instance_attrs">
>              <attributes>
>                <nvpair id="pumice_5_target_role" name="target_role" 
> value="started"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>          <primitive class="heartbeat" id="pp_6" provider="heartbeat" 
> type="pp">
>            <operations>
>              <op id="pp_6_mon" interval="120s" name="monitor" timeout="60s"/>
>            </operations>
>            <instance_attributes id="pp_6_instance_attrs">
>              <attributes>
>                <nvpair id="pp_6_target_role" name="target_role" 
> value="started"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>        </group>
>      </resources>
>      <constraints>
>        <rsc_location id="rsc_location_group_1" rsc="group_1">
>          <rule id="prefered_location_group_1" score="100">
>            <expression attribute="#uname" id="prefered_location_group_1_expr" 
> operation="eq" value="amhs-1.tern.is"/>
>          </rule>
>        </rsc_location>
>      </constraints>
>    </configuration>
>  </cib>

> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Network fails on primary node but secondary does not take over

Reply via email to