Hi, I've got an extremely frustrating problem, with apparently no real solution.
Background info: Two nodes, in Active/Passive, running DRBD and IET with pacemaker handling resources, corosync is the messaging layer. Resource monitoring is enabled on all resources. When it works, this is all fine. However, I've experienced problems with locks ups whereby a failure is not detected. The broken node does not ping on its public management IP, can't SSH to it, can't get a monitor output. But, somehow, over the nodes crossover interface, which is used for DRBD replication and corosync for messaging, it is still responding. The crossover interface IP can be pinged from the passive partner. I've got the following sysctl stuff set: sysctl -w kernel.panic_on_unrecovered_nmi=1 && sysctl -w kernel.panic_on_oops=1 && sysctl -w kernel.panic=1 When I simulate a kernel panic this does work as expected, i.e. the node is rebooted. But the lock ups we appear to be experiencing, are not kernel panics. I did have pingd running, to detect external connectivity, but had to disable this because it was constantly failing over backwards and forwards if one of our routers experienced high CPU load due to a DoS. So I'm looking for a better solution to detect node failure, when pacemaker itself seems incapable :/ Regards, James Smith _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
