On 05/21/2013 12:13 AM, Alan Robertson wrote:
> This was some kind of human-induced problem. They don't go away on 
> their own. Doing an ifdown/ifup on the main interface would do it. If 
> you're using DHCP (a really bad idea for an HA server) and it issued a 
> new netmask, for the IP then that would probably do it too.
Exactly what happened.  You're right about human-induced problems. This 
server ran for about a year and a half without so much as a reboot 
before its first failover, which was when I accidentally turned off the 
UPS on the main system.  The failover was so smooth and fast that by the 
time I hit refresh on the browser that was already displaying a database 
query, it was already there.  No one in the factory even knew it failed 
over except me.  I really hate to mess with that kind of stability.  
Upgrading at this time is not in the cards.  I already have monitoring 
programs that watch all my services.  I naively assumed this was an 
integral part of the system and that heartbeat would have handled it, 
but now I see that ipfail is just another service that needs to be 
monitored and I will add that to my system.  Thanks for the info.

> My guess is that someone did the ifdown/ifup to fix the netmask - from 
> what you said that would be necessary. And, that would definitely do 
> it. Pacemaker would have brought it back up again. The haresources 
> configuration doesn't monitor any services - so it doesn't know if 
> they're working or not -- it only monitors servers for up/down status. 
> It does what it does quite well - and it's very simple to set up. But 
> it doesn't do everything that Pacemaker does.
>>>> The network worked fine; the nodes could ping each other based on their
>>>> normal IP's and they could ping the ping node, but the virtual IP (the
>>>> one we REALLY care about) was ignored.  Nothing in the logs, no errors,
>>>> nothing.   Just an unresponsive virtual server.  A manual fail-over
>>>> brought it back quickly as the backup took over.  I.T. had done their
>>>> work on Sat and, had I checked our server on Sunday, I would have found
>>>> it "unreachable" with a normal ping.
>>>>
>>>> When my colleague called me, I asked him what "ifconfig" looked like.
>>>> He described three interfaces; eth0, eth1 and lo; no eth0:0. I had him
>>>> initiate the manual fail-over.
>>>>
>>>> After pouring over the logs, unable to find anything that indicated a
>>>> problem, I tried to simulate the problem with "ifconfig eth0:0 down".
>>>> Sure enough, no fail-over, no errors, nothing; just (once again) an
>>>> unresponsive server.  "ifconfig eth0:0 <IP_ADDRESS> up" brought it right
>>>> back (I tried this last Saturday, BTW, when no one was working).  It
>>>> seems that heartbeat (ipfail?) creates this virtual interface when it
>>>> starts, then forgets about it.  I presume that the assumption is that if
>>>> eth0 remains intact, eth0:0 will remain intact, as well.
>>>>
>>>> Am I missing something in the configuration settings or docs?  I find
>>>> nothing about configuring the backup node to monitor the virtual
>>>> address, just the other node (which has a different IP and kept working
>>>> after the network changes).  I am about to set up a service to monitor
>>>> the virtual IP, but I wanted to check with the list, first, to see if
>>>> there's already been something built in that I have not configured
>>>> correctly.  I have used main.company.com and backup.company.com as the
>>>> two hostnames of the nodes.  Both systems have these names in an
>>>> /etc/hosts file, along with the hostname and IP of the virtual server
>>>> and the ping node.
>>>>
>>>> My configuration:
>>>>
>>>> /etc/ha.d/ha.cf:
>>>>
>>>> debugfile /var/log/ha-debug
>>>> logfile    /var/log/ha-log
>>>> logfacility    local0
>>>> keepalive 2
>>>> deadtime 10
>>>> warntime 3
>>>> initdead 120
>>>> udpport    694
>>>> baud    9600
>>>> serial    /dev/ttyS0
>>>> ucast eth1 10.0.0.1
>>>> ucast eth1 10.0.0.2
>>>> auto_failback off
>>>> node main.company.com backup.company.com
>>>> ping 129.196.140.130
>>>> respawn hacluster /usr/lib/heartbeat/ipfail
>>>> deadping 10
>>>>
>>>> /etc/ha.d/haresources
>>>>
>>>> main.company.com drbddisk::drbd_resource_0
>>>> Filesystem::/dev/drbd0::/usr0::ext3 mysql IPaddr::129.196.140.14 httpd
>>>> smb MailTo::root
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to