[Linux-HA] heartbeat gets into weird state

Greg Woods Thu, 16 Oct 2008 09:47:41 -0700

I've been using heartbeat for years, since the 1.0 days, and I've never
seen anything quite like this before. I'm running
heartbeat-2.1.3-3.el5.centos (RPM from the CentOS standard repository)
on an x86_64 machine running (obviously) CentOS 5. I'm not using the v2
features though, it's a standard v1 configuration. I have a shared
partition with DRBD. It is a dual-homed machine and both sides have a
heartbeat-managed service address. The system runs a freeradius server
(which only listens on one of the shared addresses because otherwise we
run into problems with the radius responses coming from a different IP
address than the client sent them to, which doesn't work) and some local
daemons that are started out of xinetd (both under heartbeat control).
In practice, up until yesterday afternoon, this has worked very well,
with failovers taking only a few seconds and everything coming up
properly.


Yesterday, we started getting calls that radius was not working. I tried
it and it worked fine. It took a while to figure out, but it turns out
that radius was working, but only for clients on the subnet directly
connected to the service address. The same was true of pings; I could
ping the service address only from the directly-connected subnet. So
this is not a radius issue. Sounds like a lost default route, right?
Wrong. The routing table looked fine. And, even weirder, I could ping
the local address of the same interface from off net. I could ping
www.google.com from the affected host. Only the service address was not
reachable from off net, but it worked fine for hosts on the local
subnet. 

I screwed around with this for a bit while the users continued to pound
on our customer service people, and finally decided to hell with it,
let's just fail over to the other machine and get things working again.
So I did a "service heartbeat stop" to cause a failover, and it hangs on
the dreaded:

WARN: Shutdown delayed until current resource activity finishes

This basically hung forever until I hit the power button, at which point
the other machine took over and all has been well since.

But obviously I need to find out what happened here. Has anyone else
ever seen anything like this, were the service address only works on the
directly-connected subnet whereas the "home" address works from
anywhere?

I've also investigated the warning message, and all I see are people
asking about this and getting no answer, or being told it's a known bug
and they need to upgrade heartbeat. Is that the case for me too?

Thanks,
--Greg


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] heartbeat gets into weird state

Reply via email to