[Linux-HA] Network fails on primary node but secondary does not take over

Karl Pálsson Fri, 05 Oct 2007 07:02:46 -0700

Hi,

I have two nodes connected in a heartbeat cluster. They have eth0 intended for 
normal work and eth1 for heartbeat. I simulate network failure on the primary 
node (unplug network cable on eth0) and expect heartbeat to failover to the 
secondary node. This does not happen. The primary stays primary and the 
secondary stays ... secondary. The network router (which heartbeat is 
configured to ping) is on the same network as eth0.


/var/log/messages contains:
Oct  5 13:14:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first 
line isn't read in. Maybe the heartbeat does not ouput string correctly for 
status operation. Or the code (myself) is wrong.
Oct  5 13:15:17 amhs-1 ntpd[2232]: synchronized to LOCAL(0), stratum 10
Oct  5 13:15:17 amhs-1 ntpd[2232]: kernel time sync enabled 0001
Oct  5 13:16:11 amhs-1 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is 
Down
Oct  5 13:16:16 amhs-1 kernel: drbd0: PingAck did not arrive in time.
Oct  5 13:16:16 amhs-1 kernel: drbd0: peer( Secondary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct  5 13:16:16 amhs-1 kernel: drbd0: Creating new current UUID
Oct  5 13:16:16 amhs-1 kernel: drbd0: asender terminated
Oct  5 13:16:16 amhs-1 kernel: drbd0: short read expecting header on sock: 
r=-512
Oct  5 13:16:16 amhs-1 kernel: drbd0: tl_clear()
Oct  5 13:16:16 amhs-1 kernel: drbd0: Connection closed
Oct  5 13:16:16 amhs-1 kernel: drbd0: Writing meta data super block now.
Oct  5 13:16:16 amhs-1 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Oct  5 13:16:16 amhs-1 kernel: drbd0: receiver terminated
Oct  5 13:16:16 amhs-1 kernel: drbd0: receiver (re)started
Oct  5 13:16:16 amhs-1 kernel: drbd0: conn( Unconnected -> WFConnection )
Oct  5 13:16:21 amhs-1 heartbeat: [2337]: WARN: node 10.10.10.8: is dead
Oct  5 13:16:21 amhs-1 heartbeat: [2337]: info: Link 10.10.10.8:10.10.10.8 dead.
Oct  5 13:16:21 amhs-1 crmd: [2510]: notice: crmd_ha_status_callback: Status 
update: Node 10.10.10.8 now has status [dead]
Oct  5 13:16:21 amhs-1 crmd: [2510]: WARN: get_uuid: Could not calculate UUID 
for 10.10.10.8
Oct  5 13:16:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first 
line isn't read in. Maybe the heartbeat does not ouput string correctly for 
status operation. Or the code (myself) is wrong.
Oct  5 13:18:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first 
line isn't read in. Maybe the heartbeat does not ouput string correctly for 
status operation. Or the code (myself) is wrong.
Oct  5 13:20:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first 
line isn't read in. Maybe the heartbeat does not ouput string correctly for 
status operation. Or the code (myself) is wrong.
Oct  5 13:22:28 amhs-1 cib: [2506]: info: cib_stats: Processed 71 operations 
(422.00us average, 0% utilization) in the last 10min
Oct  5 13:22:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first 
line isn't read in. Maybe the heartbeat does not ouput string correctly for 
status operation. Or the code (myself) is wrong.



/etc/ha.d/ha.cf contains:
keepalive 1                          # How long between heartbeats
deadtime 10                          # How long-to-declare-host-dead?
warntime 5                           # How long before issuing "late heartbeat" 
warning?
initdead 40                          # Very first dead time (initdead)
udpport 694                          # Portnumber to use
auto_failback off                    # Remain on the node until that node fails
#watchdog /dev/watchdog               # If it does not beat for a minute the 
machine will reboot
node amhs-1.tern.is                  # Host, member of the cluster, must match 
uname -n
node amhs-2.tern.is                  # Host, member of the cluster, must match 
uname -n
bcast eth1                           # Broadcast heartbeats on eth1 interface
ping 10.10.10.8                      # Ping our router to monitor ethernet 
connectivity
respawn hacluster /usr/lib/heartbeat/dopd  
apiauth dopd gid=haclient uid=hacluster
use_logd yes
crm yes                              #Enable version 2 functionality supporting 
clusters with  > 2 nodes

"ps ax" reveals that dopd is running.

Heartbeat is of version 2.1.2. 

The OS is Centos release 5.

Cib.xml is attached.

-- 
Best regards / Bestu kveðjur
Karl Palsson

 <cib admin_epoch="0" generated="false" have_quorum="false" ignore_dtd="false" num_peers="0" cib_feature_revision="1.3" epoch="12" num_updates="2" cib-last-written="Fri Oct  5 13:12:28 2007">
   <configuration>
     <crm_config>
       <cluster_property_set id="cib-bootstrap-options">
         <attributes>
           <nvpair id="cib-bootstrap-options-symmetric-cluster" name="symmetric-cluster" value="true"/>
           <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="stop"/>
           <nvpair id="cib-bootstrap-options-default-resource-stickiness" name="default-resource-stickiness" value="0"/>
           <nvpair id="cib-bootstrap-options-default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="0"/>
           <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
           <nvpair id="cib-bootstrap-options-stonith-action" name="stonith-action" value="reboot"/>
           <nvpair id="cib-bootstrap-options-stop-orphan-resources" name="stop-orphan-resources" value="true"/>
           <nvpair id="cib-bootstrap-options-stop-orphan-actions" name="stop-orphan-actions" value="true"/>
           <nvpair id="cib-bootstrap-options-remove-after-stop" name="remove-after-stop" value="false"/>
           <nvpair id="cib-bootstrap-options-short-resource-names" name="short-resource-names" value="true"/>
           <nvpair id="cib-bootstrap-options-transition-idle-timeout" name="transition-idle-timeout" value="5min"/>
           <nvpair id="cib-bootstrap-options-default-action-timeout" name="default-action-timeout" value="15s"/>
           <nvpair id="cib-bootstrap-options-is-managed-default" name="is-managed-default" value="true"/>
           <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1189175050"/>
         </attributes>
       </cluster_property_set>
     </crm_config>
     <nodes>
       <node id="08c1a4c6-97fd-4528-8918-73ce91323664" uname="amhs-1.tern.is" type="normal"/>
       <node id="5884276e-dd43-44d9-88d6-cc7f7cd84ea4" uname="amhs-2.tern.is" type="normal"/>
     </nodes>
     <resources>
       <group id="group_1">
         <primitive class="ocf" id="IPaddr_10_10_10_220" provider="heartbeat" type="IPaddr">
           <operations>
             <op id="IPaddr_10_10_10_220_mon" interval="5s" name="monitor" timeout="5s"/>
           </operations>
           <instance_attributes id="IPaddr_10_10_10_220_inst_attr">
             <attributes>
               <nvpair id="IPaddr_10_10_10_220_attr_0" name="ip" value="10.10.10.220"/>
               <nvpair id="IPaddr_10_10_10_220_attr_1" name="netmask" value="24"/>
             </attributes>
           </instance_attributes>
         </primitive>
         <primitive class="heartbeat" id="drbddisk_2" provider="heartbeat" type="drbddisk">
           <operations>
             <op id="drbddisk_2_mon" interval="120s" name="monitor" timeout="60s"/>
           </operations>
           <instance_attributes id="drbddisk_2_inst_attr">
             <attributes>
               <nvpair id="drbddisk_2_attr_1" name="1" value="IsodeResource"/>
             </attributes>
           </instance_attributes>
         </primitive>
         <primitive class="ocf" id="Filesystem_3" provider="heartbeat" type="Filesystem">
           <operations>
             <op id="Filesystem_3_mon" interval="120s" name="monitor" timeout="60s"/>
           </operations>
           <instance_attributes id="Filesystem_3_inst_attr">
             <attributes>
               <nvpair id="Filesystem_3_attr_0" name="device" value="/dev/drbd0"/>
               <nvpair id="Filesystem_3_attr_1" name="directory" value="/Isode"/>
               <nvpair id="Filesystem_3_attr_2" name="fstype" value="ext3"/>
             </attributes>
           </instance_attributes>
         </primitive>
         <primitive class="heartbeat" id="eddy_4" provider="heartbeat" type="eddy">
           <operations>
             <op id="eddy_4_mon" interval="120s" name="monitor" timeout="60s"/>
           </operations>
           <instance_attributes id="eddy_4_instance_attrs">
             <attributes>
               <nvpair id="eddy_4_target_role" name="target_role" value="started"/>
             </attributes>
           </instance_attributes>
         </primitive>
         <primitive class="heartbeat" id="pumice_5" provider="heartbeat" type="pumice">
           <operations>
             <op id="pumice_5_mon" interval="120s" name="monitor" timeout="60s"/>
           </operations>
           <instance_attributes id="pumice_5_instance_attrs">
             <attributes>
               <nvpair id="pumice_5_target_role" name="target_role" value="started"/>
             </attributes>
           </instance_attributes>
         </primitive>
         <primitive class="heartbeat" id="pp_6" provider="heartbeat" type="pp">
           <operations>
             <op id="pp_6_mon" interval="120s" name="monitor" timeout="60s"/>
           </operations>
           <instance_attributes id="pp_6_instance_attrs">
             <attributes>
               <nvpair id="pp_6_target_role" name="target_role" value="started"/>
             </attributes>
           </instance_attributes>
         </primitive>
       </group>
     </resources>
     <constraints>
       <rsc_location id="rsc_location_group_1" rsc="group_1">
         <rule id="prefered_location_group_1" score="100">
           <expression attribute="#uname" id="prefered_location_group_1_expr" operation="eq" value="amhs-1.tern.is"/>
         </rule>
       </rsc_location>
     </constraints>
   </configuration>
 </cib>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Network fails on primary node but secondary does not take over

Reply via email to