I've been using heartbeat for years, since the 1.0 days, and I've never seen anything quite like this before. I'm running heartbeat-2.1.3-3.el5.centos (RPM from the CentOS standard repository) on an x86_64 machine running (obviously) CentOS 5. I'm not using the v2 features though, it's a standard v1 configuration. I have a shared partition with DRBD. It is a dual-homed machine and both sides have a heartbeat-managed service address. The system runs a freeradius server (which only listens on one of the shared addresses because otherwise we run into problems with the radius responses coming from a different IP address than the client sent them to, which doesn't work) and some local daemons that are started out of xinetd (both under heartbeat control). In practice, up until yesterday afternoon, this has worked very well, with failovers taking only a few seconds and everything coming up properly.
Yesterday, we started getting calls that radius was not working. I tried it and it worked fine. It took a while to figure out, but it turns out that radius was working, but only for clients on the subnet directly connected to the service address. The same was true of pings; I could ping the service address only from the directly-connected subnet. So this is not a radius issue. Sounds like a lost default route, right? Wrong. The routing table looked fine. And, even weirder, I could ping the local address of the same interface from off net. I could ping www.google.com from the affected host. Only the service address was not reachable from off net, but it worked fine for hosts on the local subnet. I screwed around with this for a bit while the users continued to pound on our customer service people, and finally decided to hell with it, let's just fail over to the other machine and get things working again. So I did a "service heartbeat stop" to cause a failover, and it hangs on the dreaded: WARN: Shutdown delayed until current resource activity finishes This basically hung forever until I hit the power button, at which point the other machine took over and all has been well since. But obviously I need to find out what happened here. Has anyone else ever seen anything like this, were the service address only works on the directly-connected subnet whereas the "home" address works from anywhere? I've also investigated the warning message, and all I see are people asking about this and getting no answer, or being told it's a known bug and they need to upgrade heartbeat. Is that the case for me too? Thanks, --Greg _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
