Hi, On Fri, Jan 09, 2009 at 10:34:22AM +0100, Tobias Appel wrote: > On Mon, 2008-12-22 at 19:42 +0100, Dejan Muhamedagic wrote: > > Hi, > > > > On Mon, Dec 22, 2008 at 12:18:06PM +0100, Tobias Appel wrote: > > > Hi, > > > > > > sorry to bug you guys again before christmas but I have a very weird > > > error. > > > I have a 2 node setup with drbd and Heartbeat 2.14. One resource group > > > which contains Nagios (something like BigBrother). > > > > > > Now I configured everything and did some tests with starting and stoping > > > heartbeat service on the servers - the failover worked. > > > > > > But if I run 'shutdown -r now' on the active node the server will not > > > reboot and the resource group will not be moved to the passive node. > > > When I run crm_mon I can see: > > > nagios-core (lsb:nagios): Started node01 (unmanaged) FAILED > > > > > > The server will do nothing then. It will not reboot, the rest of the > > > resource group is still running! The log file from nagios tells me it > > > correctly shutdown. I did browse through the big big ha-log but I > > > couldn't find anything that would help me. > > > > > > pengine[27246]: 2008/12/22_11:47:11 WARN: unpack_rsc_op: Processing > > > failed op nagios-core_stop_0 on node01: Error > > > > > > I really have no idea what to look for or what to do. > > > > A resource failed to stop. That's typically a reason to kill the > > node, but you probably don't have stonith setup. If a resource > > can't be stopped and there's no stonith enabled, then that > > resource can't be started anywhere. > > > > Thanks, > > > > Dejan > > Hi, > > and happy new year everybody - just came back from holiday. > > You are right I don't have stonith enabled because I don't really > understand it fully yet. I know what it means and what it should do but > I thought it works as fencing in conjunction with a UPS or fibre-channel > switch device.
stonith is a method to fence nodes. Even though fencing and stonith mean different things, the two terms are often used interchangeably. > It is correct that the problem is that the resource can not be stopped - > or at least the CRM thinks it can not be stopped. CRM works only with what the resource agent provides. If the RA provides nonsense then I really doubt that your cluster would be of much use. > I had the same problem > with the RedHat Cluster Software on the same server - it also could not > stop the nagios resource and the cluster was in a failed state. > > Now what you are saying is that stonith would be my solution. No. Stonith could only be used to resolve the situation. It's a sort of ultimate tool to try to get to a sane state. Because if a resource can't be stopped then there's nothing else one can do but try to reboot the node. The real problem is why your resource can't be stopped and you should resolve that first. Stonith should be employed only in case of real failures, not to recover buggy software. > When I > turn off one cluster node and the resource goes into an unmanaged state, > the other node could declare it as dead and go online? Yes. > Can anyone please point me to a stonith how-to which is not based on a > UPS or something like this? http://www.linux-ha.org/STONITH http://www.linux-ha.org/CIB/Idioms#head-588b6605fa1c0eb9ea07aec69ea4890ad078d5d2 Thanks, Dejan > I also can't much about in in the book from > Dr. Schwartzkopff :( > This would be really helpful. > > Thanks in advance, > > Tobias > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
