Hi, On Fri, Oct 05, 2007 at 01:50:20PM -0000, Karl Pálsson wrote: > Hi, > > I have two nodes connected in a heartbeat cluster. They have > eth0 intended for normal work and eth1 for heartbeat. I > simulate network failure on the primary node (unplug network > cable on eth0) and expect heartbeat to failover to the > secondary node. This does not happen. The primary stays primary > and the secondary stays ... secondary. The network router > (which heartbeat is configured to ping) is on the same network > as eth0.
You also have to configure pingd and a constraint to have the failover work. See: http://linux-ha.org/pingd or the mailing list archives. > /var/log/messages contains: > Oct 5 13:14:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the > first line isn't read in. Maybe the heartbeat does not ouput string correctly > for status operation. Or the code (myself) is wrong. I suspect that this is due to your RA eddy not printing anything to stdout on monitor/status. It should. Check http://www.linux-ha.org/HeartbeatResourceAgent Thanks, Dejan > Oct 5 13:15:17 amhs-1 ntpd[2232]: synchronized to LOCAL(0), stratum 10 > Oct 5 13:15:17 amhs-1 ntpd[2232]: kernel time sync enabled 0001 > Oct 5 13:16:11 amhs-1 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is > Down > Oct 5 13:16:16 amhs-1 kernel: drbd0: PingAck did not arrive in time. > Oct 5 13:16:16 amhs-1 kernel: drbd0: peer( Secondary -> Unknown ) conn( > Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Oct 5 13:16:16 amhs-1 kernel: drbd0: Creating new current UUID > Oct 5 13:16:16 amhs-1 kernel: drbd0: asender terminated > Oct 5 13:16:16 amhs-1 kernel: drbd0: short read expecting header on sock: > r=-512 > Oct 5 13:16:16 amhs-1 kernel: drbd0: tl_clear() > Oct 5 13:16:16 amhs-1 kernel: drbd0: Connection closed > Oct 5 13:16:16 amhs-1 kernel: drbd0: Writing meta data super block now. > Oct 5 13:16:16 amhs-1 kernel: drbd0: conn( NetworkFailure -> Unconnected ) > Oct 5 13:16:16 amhs-1 kernel: drbd0: receiver terminated > Oct 5 13:16:16 amhs-1 kernel: drbd0: receiver (re)started > Oct 5 13:16:16 amhs-1 kernel: drbd0: conn( Unconnected -> WFConnection ) > Oct 5 13:16:21 amhs-1 heartbeat: [2337]: WARN: node 10.10.10.8: is dead > Oct 5 13:16:21 amhs-1 heartbeat: [2337]: info: Link 10.10.10.8:10.10.10.8 > dead. > Oct 5 13:16:21 amhs-1 crmd: [2510]: notice: crmd_ha_status_callback: Status > update: Node 10.10.10.8 now has status [dead] > Oct 5 13:16:21 amhs-1 crmd: [2510]: WARN: get_uuid: Could not calculate UUID > for 10.10.10.8 > Oct 5 13:16:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the > first line isn't read in. Maybe the heartbeat does not ouput string correctly > for status operation. Or the code (myself) is wrong. > Oct 5 13:18:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the > first line isn't read in. Maybe the heartbeat does not ouput string correctly > for status operation. Or the code (myself) is wrong. > Oct 5 13:20:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the > first line isn't read in. Maybe the heartbeat does not ouput string correctly > for status operation. Or the code (myself) is wrong. > Oct 5 13:22:28 amhs-1 cib: [2506]: info: cib_stats: Processed 71 operations > (422.00us average, 0% utilization) in the last 10min > Oct 5 13:22:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the > first line isn't read in. Maybe the heartbeat does not ouput string correctly > for status operation. Or the code (myself) is wrong. > > > > /etc/ha.d/ha.cf contains: > keepalive 1 # How long between heartbeats > deadtime 10 # How long-to-declare-host-dead? > warntime 5 # How long before issuing "late > heartbeat" warning? > initdead 40 # Very first dead time (initdead) > udpport 694 # Portnumber to use > auto_failback off # Remain on the node until that node > fails > #watchdog /dev/watchdog # If it does not beat for a minute the > machine will reboot > node amhs-1.tern.is # Host, member of the cluster, must > match uname -n > node amhs-2.tern.is # Host, member of the cluster, must > match uname -n > bcast eth1 # Broadcast heartbeats on eth1 interface > ping 10.10.10.8 # Ping our router to monitor ethernet > connectivity > respawn hacluster /usr/lib/heartbeat/dopd > apiauth dopd gid=haclient uid=hacluster > use_logd yes > crm yes #Enable version 2 functionality > supporting clusters with > 2 nodes > > "ps ax" reveals that dopd is running. > > Heartbeat is of version 2.1.2. > > The OS is Centos release 5. > > Cib.xml is attached. > > -- > Best regards / Bestu kveðjur > Karl Palsson > Content-Description: cib.xml > <cib admin_epoch="0" generated="false" have_quorum="false" > ignore_dtd="false" num_peers="0" cib_feature_revision="1.3" epoch="12" > num_updates="2" cib-last-written="Fri Oct 5 13:12:28 2007"> > <configuration> > <crm_config> > <cluster_property_set id="cib-bootstrap-options"> > <attributes> > <nvpair id="cib-bootstrap-options-symmetric-cluster" > name="symmetric-cluster" value="true"/> > <nvpair id="cib-bootstrap-options-no-quorum-policy" > name="no-quorum-policy" value="stop"/> > <nvpair id="cib-bootstrap-options-default-resource-stickiness" > name="default-resource-stickiness" value="0"/> > <nvpair > id="cib-bootstrap-options-default-resource-failure-stickiness" > name="default-resource-failure-stickiness" value="0"/> > <nvpair id="cib-bootstrap-options-stonith-enabled" > name="stonith-enabled" value="false"/> > <nvpair id="cib-bootstrap-options-stonith-action" > name="stonith-action" value="reboot"/> > <nvpair id="cib-bootstrap-options-stop-orphan-resources" > name="stop-orphan-resources" value="true"/> > <nvpair id="cib-bootstrap-options-stop-orphan-actions" > name="stop-orphan-actions" value="true"/> > <nvpair id="cib-bootstrap-options-remove-after-stop" > name="remove-after-stop" value="false"/> > <nvpair id="cib-bootstrap-options-short-resource-names" > name="short-resource-names" value="true"/> > <nvpair id="cib-bootstrap-options-transition-idle-timeout" > name="transition-idle-timeout" value="5min"/> > <nvpair id="cib-bootstrap-options-default-action-timeout" > name="default-action-timeout" value="15s"/> > <nvpair id="cib-bootstrap-options-is-managed-default" > name="is-managed-default" value="true"/> > <nvpair id="cib-bootstrap-options-last-lrm-refresh" > name="last-lrm-refresh" value="1189175050"/> > </attributes> > </cluster_property_set> > </crm_config> > <nodes> > <node id="08c1a4c6-97fd-4528-8918-73ce91323664" uname="amhs-1.tern.is" > type="normal"/> > <node id="5884276e-dd43-44d9-88d6-cc7f7cd84ea4" uname="amhs-2.tern.is" > type="normal"/> > </nodes> > <resources> > <group id="group_1"> > <primitive class="ocf" id="IPaddr_10_10_10_220" provider="heartbeat" > type="IPaddr"> > <operations> > <op id="IPaddr_10_10_10_220_mon" interval="5s" name="monitor" > timeout="5s"/> > </operations> > <instance_attributes id="IPaddr_10_10_10_220_inst_attr"> > <attributes> > <nvpair id="IPaddr_10_10_10_220_attr_0" name="ip" > value="10.10.10.220"/> > <nvpair id="IPaddr_10_10_10_220_attr_1" name="netmask" > value="24"/> > </attributes> > </instance_attributes> > </primitive> > <primitive class="heartbeat" id="drbddisk_2" provider="heartbeat" > type="drbddisk"> > <operations> > <op id="drbddisk_2_mon" interval="120s" name="monitor" > timeout="60s"/> > </operations> > <instance_attributes id="drbddisk_2_inst_attr"> > <attributes> > <nvpair id="drbddisk_2_attr_1" name="1" value="IsodeResource"/> > </attributes> > </instance_attributes> > </primitive> > <primitive class="ocf" id="Filesystem_3" provider="heartbeat" > type="Filesystem"> > <operations> > <op id="Filesystem_3_mon" interval="120s" name="monitor" > timeout="60s"/> > </operations> > <instance_attributes id="Filesystem_3_inst_attr"> > <attributes> > <nvpair id="Filesystem_3_attr_0" name="device" > value="/dev/drbd0"/> > <nvpair id="Filesystem_3_attr_1" name="directory" > value="/Isode"/> > <nvpair id="Filesystem_3_attr_2" name="fstype" value="ext3"/> > </attributes> > </instance_attributes> > </primitive> > <primitive class="heartbeat" id="eddy_4" provider="heartbeat" > type="eddy"> > <operations> > <op id="eddy_4_mon" interval="120s" name="monitor" > timeout="60s"/> > </operations> > <instance_attributes id="eddy_4_instance_attrs"> > <attributes> > <nvpair id="eddy_4_target_role" name="target_role" > value="started"/> > </attributes> > </instance_attributes> > </primitive> > <primitive class="heartbeat" id="pumice_5" provider="heartbeat" > type="pumice"> > <operations> > <op id="pumice_5_mon" interval="120s" name="monitor" > timeout="60s"/> > </operations> > <instance_attributes id="pumice_5_instance_attrs"> > <attributes> > <nvpair id="pumice_5_target_role" name="target_role" > value="started"/> > </attributes> > </instance_attributes> > </primitive> > <primitive class="heartbeat" id="pp_6" provider="heartbeat" > type="pp"> > <operations> > <op id="pp_6_mon" interval="120s" name="monitor" timeout="60s"/> > </operations> > <instance_attributes id="pp_6_instance_attrs"> > <attributes> > <nvpair id="pp_6_target_role" name="target_role" > value="started"/> > </attributes> > </instance_attributes> > </primitive> > </group> > </resources> > <constraints> > <rsc_location id="rsc_location_group_1" rsc="group_1"> > <rule id="prefered_location_group_1" score="100"> > <expression attribute="#uname" id="prefered_location_group_1_expr" > operation="eq" value="amhs-1.tern.is"/> > </rule> > </rsc_location> > </constraints> > </configuration> > </cib> > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
