Dejan,

I raised Bug 2392.
Thanks -

Simon.


-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Dejan
Muhamedagic
Sent: Friday, April 02, 2010 5:37 AM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] CRM hung, node wedged,but heartbeats still being
sent

Hi,

On Wed, Mar 31, 2010 at 02:49:45PM -0400, Tavanyar, Simon wrote:
> I'm running 2.1.4 (please don't shoot).
> 
> A disk error managed to grind everything to a halt on my primary node.
> No software accessing disk was able to run. 
> 
> Nothing is being written to ha.log.  Resources are no longer
responding
> to monitors, but the CRM is hung too, so it won't notice.
> 
> HOWEVER, we are still happily sending heartbeats  - so the other node
> never takes over.
> 
> We have a dead node in every way except one: it keeps telling the
other
> heartbeat, "I'm alive"!
> 
> 1)      Has anyone else seen this type of failure?

Yes, there has been a similar situation a few months ago, but
with a corosync/openais cluster. The load on the node has been
so high that nothing could move, but the corosync process,
running at the real-time priority, was happily sending
heartbeats, so there was no failover. Here's the thread:
http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/004739.html

> 2)      What ensures that a heartbeat will not be sent if CRM is
> hung/wedged?

Nothing. Heartbeat can only notice if the process leaves.
Otherwise, it's not monitored in any way. Perhaps Heartbeat
occasionaly polling crmd could help in situations such as this
one. Another option is for crmd to also send a kind of heartbeat
to other members in the cluster.

You should open a bugzilla for this issue.

Thanks,

Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to