Hi,

I've got an extremely frustrating problem, with apparently no real solution.

Background info:

Two nodes, in Active/Passive, running DRBD and IET with pacemaker handling 
resources,
corosync is the messaging layer.  Resource monitoring is enabled on all 
resources.

When it works, this is all fine.

However, I've experienced problems with locks ups whereby a failure is not 
detected.
The broken node does not ping on its public management IP, can't SSH to it, 
can't get a
monitor output.

But, somehow, over the nodes crossover interface, which is used for DRBD 
replication and
corosync for messaging, it is still responding.  The crossover interface IP can 
be pinged
from the passive partner.

I've got the following sysctl stuff set:

sysctl -w kernel.panic_on_unrecovered_nmi=1 && sysctl -w kernel.panic_on_oops=1 
&&
sysctl -w kernel.panic=1

When I simulate a kernel panic this does work as expected, i.e. the node is 
rebooted.  But
the lock ups we appear to be experiencing, are not kernel panics.

I did have pingd running, to detect external connectivity, but had to disable 
this because it was
constantly failing over backwards and forwards if one of our routers 
experienced high CPU
load due to a DoS.  So I'm looking for a better solution to detect node 
failure, when pacemaker
itself seems incapable :/

Regards,
James Smith

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to