We are running heartbeat 2.1.3 on CentOS 5.4.  Last Monday AM, I 
received a call while getting ready for work.  Our high availability 
server was not responding.  The previous Saturday, our I.T. admins had 
re-configured the network to expand IP address ranges on some subnets.  
For whatever reason, this action caused our main server (in a two-node 
HA configuration) to loose its virtual interface, rendering our 
high-availability server unavailable.

The network worked fine; the nodes could ping each other based on their 
normal IP's and they could ping the ping node, but the virtual IP (the 
one we REALLY care about) was ignored.  Nothing in the logs, no errors, 
nothing.   Just an unresponsive virtual server.  A manual fail-over 
brought it back quickly as the backup took over.  I.T. had done their 
work on Sat and, had I checked our server on Sunday, I would have found 
it "unreachable" with a normal ping.

When my colleague called me, I asked him what "ifconfig" looked like.  
He described three interfaces; eth0, eth1 and lo; no eth0:0. I had him 
initiate the manual fail-over.

After pouring over the logs, unable to find anything that indicated a 
problem, I tried to simulate the problem with "ifconfig eth0:0 down".  
Sure enough, no fail-over, no errors, nothing; just (once again) an 
unresponsive server.  "ifconfig eth0:0 <IP_ADDRESS> up" brought it right 
back (I tried this last Saturday, BTW, when no one was working).  It 
seems that heartbeat (ipfail?) creates this virtual interface when it 
starts, then forgets about it.  I presume that the assumption is that if 
eth0 remains intact, eth0:0 will remain intact, as well.

Am I missing something in the configuration settings or docs?  I find 
nothing about configuring the backup node to monitor the virtual 
address, just the other node (which has a different IP and kept working 
after the network changes).  I am about to set up a service to monitor 
the virtual IP, but I wanted to check with the list, first, to see if 
there's already been something built in that I have not configured 
correctly.  I have used main.company.com and backup.company.com as the 
two hostnames of the nodes.  Both systems have these names in an 
/etc/hosts file, along with the hostname and IP of the virtual server 
and the ping node.

My configuration:

/etc/ha.d/ha.cf:

debugfile /var/log/ha-debug
logfile    /var/log/ha-log
logfacility    local0
keepalive 2
deadtime 10
warntime 3
initdead 120
udpport    694
baud    9600
serial    /dev/ttyS0
ucast eth1 10.0.0.1
ucast eth1 10.0.0.2
auto_failback off
node main.company.com backup.company.com
ping 129.196.140.130
respawn hacluster /usr/lib/heartbeat/ipfail
deadping 10

/etc/ha.d/haresources

main.company.com drbddisk::drbd_resource_0 
Filesystem::/dev/drbd0::/usr0::ext3 mysql IPaddr::129.196.140.14 httpd 
smb MailTo::root




_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to