On 05/21/2013 12:13 AM, Alan Robertson wrote: > This was some kind of human-induced problem. They don't go away on > their own. Doing an ifdown/ifup on the main interface would do it. If > you're using DHCP (a really bad idea for an HA server) and it issued a > new netmask, for the IP then that would probably do it too. Exactly what happened. You're right about human-induced problems. This server ran for about a year and a half without so much as a reboot before its first failover, which was when I accidentally turned off the UPS on the main system. The failover was so smooth and fast that by the time I hit refresh on the browser that was already displaying a database query, it was already there. No one in the factory even knew it failed over except me. I really hate to mess with that kind of stability. Upgrading at this time is not in the cards. I already have monitoring programs that watch all my services. I naively assumed this was an integral part of the system and that heartbeat would have handled it, but now I see that ipfail is just another service that needs to be monitored and I will add that to my system. Thanks for the info.
> My guess is that someone did the ifdown/ifup to fix the netmask - from > what you said that would be necessary. And, that would definitely do > it. Pacemaker would have brought it back up again. The haresources > configuration doesn't monitor any services - so it doesn't know if > they're working or not -- it only monitors servers for up/down status. > It does what it does quite well - and it's very simple to set up. But > it doesn't do everything that Pacemaker does. >>>> The network worked fine; the nodes could ping each other based on their >>>> normal IP's and they could ping the ping node, but the virtual IP (the >>>> one we REALLY care about) was ignored. Nothing in the logs, no errors, >>>> nothing. Just an unresponsive virtual server. A manual fail-over >>>> brought it back quickly as the backup took over. I.T. had done their >>>> work on Sat and, had I checked our server on Sunday, I would have found >>>> it "unreachable" with a normal ping. >>>> >>>> When my colleague called me, I asked him what "ifconfig" looked like. >>>> He described three interfaces; eth0, eth1 and lo; no eth0:0. I had him >>>> initiate the manual fail-over. >>>> >>>> After pouring over the logs, unable to find anything that indicated a >>>> problem, I tried to simulate the problem with "ifconfig eth0:0 down". >>>> Sure enough, no fail-over, no errors, nothing; just (once again) an >>>> unresponsive server. "ifconfig eth0:0 <IP_ADDRESS> up" brought it right >>>> back (I tried this last Saturday, BTW, when no one was working). It >>>> seems that heartbeat (ipfail?) creates this virtual interface when it >>>> starts, then forgets about it. I presume that the assumption is that if >>>> eth0 remains intact, eth0:0 will remain intact, as well. >>>> >>>> Am I missing something in the configuration settings or docs? I find >>>> nothing about configuring the backup node to monitor the virtual >>>> address, just the other node (which has a different IP and kept working >>>> after the network changes). I am about to set up a service to monitor >>>> the virtual IP, but I wanted to check with the list, first, to see if >>>> there's already been something built in that I have not configured >>>> correctly. I have used main.company.com and backup.company.com as the >>>> two hostnames of the nodes. Both systems have these names in an >>>> /etc/hosts file, along with the hostname and IP of the virtual server >>>> and the ping node. >>>> >>>> My configuration: >>>> >>>> /etc/ha.d/ha.cf: >>>> >>>> debugfile /var/log/ha-debug >>>> logfile /var/log/ha-log >>>> logfacility local0 >>>> keepalive 2 >>>> deadtime 10 >>>> warntime 3 >>>> initdead 120 >>>> udpport 694 >>>> baud 9600 >>>> serial /dev/ttyS0 >>>> ucast eth1 10.0.0.1 >>>> ucast eth1 10.0.0.2 >>>> auto_failback off >>>> node main.company.com backup.company.com >>>> ping 129.196.140.130 >>>> respawn hacluster /usr/lib/heartbeat/ipfail >>>> deadping 10 >>>> >>>> /etc/ha.d/haresources >>>> >>>> main.company.com drbddisk::drbd_resource_0 >>>> Filesystem::/dev/drbd0::/usr0::ext3 mysql IPaddr::129.196.140.14 httpd >>>> smb MailTo::root >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> [email protected] >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>>> >>> _______________________________________________ >>> Linux-HA mailing list >>> [email protected] >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
