Hi,
I have two nodes connected in a heartbeat cluster. They have eth0 intended for
normal work and eth1 for heartbeat. I simulate network failure on the primary
node (unplug network cable on eth0) and expect heartbeat to failover to the
secondary node. This does not happen. The primary stays primary and the
secondary stays ... secondary. The network router (which heartbeat is
configured to ping) is on the same network as eth0.
/var/log/messages contains:
Oct 5 13:14:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first
line isn't read in. Maybe the heartbeat does not ouput string correctly for
status operation. Or the code (myself) is wrong.
Oct 5 13:15:17 amhs-1 ntpd[2232]: synchronized to LOCAL(0), stratum 10
Oct 5 13:15:17 amhs-1 ntpd[2232]: kernel time sync enabled 0001
Oct 5 13:16:11 amhs-1 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is
Down
Oct 5 13:16:16 amhs-1 kernel: drbd0: PingAck did not arrive in time.
Oct 5 13:16:16 amhs-1 kernel: drbd0: peer( Secondary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 5 13:16:16 amhs-1 kernel: drbd0: Creating new current UUID
Oct 5 13:16:16 amhs-1 kernel: drbd0: asender terminated
Oct 5 13:16:16 amhs-1 kernel: drbd0: short read expecting header on sock:
r=-512
Oct 5 13:16:16 amhs-1 kernel: drbd0: tl_clear()
Oct 5 13:16:16 amhs-1 kernel: drbd0: Connection closed
Oct 5 13:16:16 amhs-1 kernel: drbd0: Writing meta data super block now.
Oct 5 13:16:16 amhs-1 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Oct 5 13:16:16 amhs-1 kernel: drbd0: receiver terminated
Oct 5 13:16:16 amhs-1 kernel: drbd0: receiver (re)started
Oct 5 13:16:16 amhs-1 kernel: drbd0: conn( Unconnected -> WFConnection )
Oct 5 13:16:21 amhs-1 heartbeat: [2337]: WARN: node 10.10.10.8: is dead
Oct 5 13:16:21 amhs-1 heartbeat: [2337]: info: Link 10.10.10.8:10.10.10.8 dead.
Oct 5 13:16:21 amhs-1 crmd: [2510]: notice: crmd_ha_status_callback: Status
update: Node 10.10.10.8 now has status [dead]
Oct 5 13:16:21 amhs-1 crmd: [2510]: WARN: get_uuid: Could not calculate UUID
for 10.10.10.8
Oct 5 13:16:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first
line isn't read in. Maybe the heartbeat does not ouput string correctly for
status operation. Or the code (myself) is wrong.
Oct 5 13:18:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first
line isn't read in. Maybe the heartbeat does not ouput string correctly for
status operation. Or the code (myself) is wrong.
Oct 5 13:20:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first
line isn't read in. Maybe the heartbeat does not ouput string correctly for
status operation. Or the code (myself) is wrong.
Oct 5 13:22:28 amhs-1 cib: [2506]: info: cib_stats: Processed 71 operations
(422.00us average, 0% utilization) in the last 10min
Oct 5 13:22:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first
line isn't read in. Maybe the heartbeat does not ouput string correctly for
status operation. Or the code (myself) is wrong.
/etc/ha.d/ha.cf contains:
keepalive 1 # How long between heartbeats
deadtime 10 # How long-to-declare-host-dead?
warntime 5 # How long before issuing "late heartbeat"
warning?
initdead 40 # Very first dead time (initdead)
udpport 694 # Portnumber to use
auto_failback off # Remain on the node until that node fails
#watchdog /dev/watchdog # If it does not beat for a minute the
machine will reboot
node amhs-1.tern.is # Host, member of the cluster, must match
uname -n
node amhs-2.tern.is # Host, member of the cluster, must match
uname -n
bcast eth1 # Broadcast heartbeats on eth1 interface
ping 10.10.10.8 # Ping our router to monitor ethernet
connectivity
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster
use_logd yes
crm yes #Enable version 2 functionality supporting
clusters with > 2 nodes
"ps ax" reveals that dopd is running.
Heartbeat is of version 2.1.2.
The OS is Centos release 5.
Cib.xml is attached.
--
Best regards / Bestu kveðjur
Karl Palsson
<cib admin_epoch="0" generated="false" have_quorum="false" ignore_dtd="false" num_peers="0" cib_feature_revision="1.3" epoch="12" num_updates="2" cib-last-written="Fri Oct 5 13:12:28 2007">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<attributes>
<nvpair id="cib-bootstrap-options-symmetric-cluster" name="symmetric-cluster" value="true"/>
<nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="stop"/>
<nvpair id="cib-bootstrap-options-default-resource-stickiness" name="default-resource-stickiness" value="0"/>
<nvpair id="cib-bootstrap-options-default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="0"/>
<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
<nvpair id="cib-bootstrap-options-stonith-action" name="stonith-action" value="reboot"/>
<nvpair id="cib-bootstrap-options-stop-orphan-resources" name="stop-orphan-resources" value="true"/>
<nvpair id="cib-bootstrap-options-stop-orphan-actions" name="stop-orphan-actions" value="true"/>
<nvpair id="cib-bootstrap-options-remove-after-stop" name="remove-after-stop" value="false"/>
<nvpair id="cib-bootstrap-options-short-resource-names" name="short-resource-names" value="true"/>
<nvpair id="cib-bootstrap-options-transition-idle-timeout" name="transition-idle-timeout" value="5min"/>
<nvpair id="cib-bootstrap-options-default-action-timeout" name="default-action-timeout" value="15s"/>
<nvpair id="cib-bootstrap-options-is-managed-default" name="is-managed-default" value="true"/>
<nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1189175050"/>
</attributes>
</cluster_property_set>
</crm_config>
<nodes>
<node id="08c1a4c6-97fd-4528-8918-73ce91323664" uname="amhs-1.tern.is" type="normal"/>
<node id="5884276e-dd43-44d9-88d6-cc7f7cd84ea4" uname="amhs-2.tern.is" type="normal"/>
</nodes>
<resources>
<group id="group_1">
<primitive class="ocf" id="IPaddr_10_10_10_220" provider="heartbeat" type="IPaddr">
<operations>
<op id="IPaddr_10_10_10_220_mon" interval="5s" name="monitor" timeout="5s"/>
</operations>
<instance_attributes id="IPaddr_10_10_10_220_inst_attr">
<attributes>
<nvpair id="IPaddr_10_10_10_220_attr_0" name="ip" value="10.10.10.220"/>
<nvpair id="IPaddr_10_10_10_220_attr_1" name="netmask" value="24"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="heartbeat" id="drbddisk_2" provider="heartbeat" type="drbddisk">
<operations>
<op id="drbddisk_2_mon" interval="120s" name="monitor" timeout="60s"/>
</operations>
<instance_attributes id="drbddisk_2_inst_attr">
<attributes>
<nvpair id="drbddisk_2_attr_1" name="1" value="IsodeResource"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="ocf" id="Filesystem_3" provider="heartbeat" type="Filesystem">
<operations>
<op id="Filesystem_3_mon" interval="120s" name="monitor" timeout="60s"/>
</operations>
<instance_attributes id="Filesystem_3_inst_attr">
<attributes>
<nvpair id="Filesystem_3_attr_0" name="device" value="/dev/drbd0"/>
<nvpair id="Filesystem_3_attr_1" name="directory" value="/Isode"/>
<nvpair id="Filesystem_3_attr_2" name="fstype" value="ext3"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="heartbeat" id="eddy_4" provider="heartbeat" type="eddy">
<operations>
<op id="eddy_4_mon" interval="120s" name="monitor" timeout="60s"/>
</operations>
<instance_attributes id="eddy_4_instance_attrs">
<attributes>
<nvpair id="eddy_4_target_role" name="target_role" value="started"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="heartbeat" id="pumice_5" provider="heartbeat" type="pumice">
<operations>
<op id="pumice_5_mon" interval="120s" name="monitor" timeout="60s"/>
</operations>
<instance_attributes id="pumice_5_instance_attrs">
<attributes>
<nvpair id="pumice_5_target_role" name="target_role" value="started"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="heartbeat" id="pp_6" provider="heartbeat" type="pp">
<operations>
<op id="pp_6_mon" interval="120s" name="monitor" timeout="60s"/>
</operations>
<instance_attributes id="pp_6_instance_attrs">
<attributes>
<nvpair id="pp_6_target_role" name="target_role" value="started"/>
</attributes>
</instance_attributes>
</primitive>
</group>
</resources>
<constraints>
<rsc_location id="rsc_location_group_1" rsc="group_1">
<rule id="prefered_location_group_1" score="100">
<expression attribute="#uname" id="prefered_location_group_1_expr" operation="eq" value="amhs-1.tern.is"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems