On Mon, 2008-12-22 at 19:42 +0100, Dejan Muhamedagic wrote: > Hi, > > On Mon, Dec 22, 2008 at 12:18:06PM +0100, Tobias Appel wrote: > > Hi, > > > > sorry to bug you guys again before christmas but I have a very weird > > error. > > I have a 2 node setup with drbd and Heartbeat 2.14. One resource group > > which contains Nagios (something like BigBrother). > > > > Now I configured everything and did some tests with starting and stoping > > heartbeat service on the servers - the failover worked. > > > > But if I run 'shutdown -r now' on the active node the server will not > > reboot and the resource group will not be moved to the passive node. > > When I run crm_mon I can see: > > nagios-core (lsb:nagios): Started node01 (unmanaged) FAILED > > > > The server will do nothing then. It will not reboot, the rest of the > > resource group is still running! The log file from nagios tells me it > > correctly shutdown. I did browse through the big big ha-log but I > > couldn't find anything that would help me. > > > > pengine[27246]: 2008/12/22_11:47:11 WARN: unpack_rsc_op: Processing > > failed op nagios-core_stop_0 on node01: Error > > > > I really have no idea what to look for or what to do. > > A resource failed to stop. That's typically a reason to kill the > node, but you probably don't have stonith setup. If a resource > can't be stopped and there's no stonith enabled, then that > resource can't be started anywhere. > > Thanks, > > Dejan
Hi, and happy new year everybody - just came back from holiday. You are right I don't have stonith enabled because I don't really understand it fully yet. I know what it means and what it should do but I thought it works as fencing in conjunction with a UPS or fibre-channel switch device. It is correct that the problem is that the resource can not be stopped - or at least the CRM thinks it can not be stopped. I had the same problem with the RedHat Cluster Software on the same server - it also could not stop the nagios resource and the cluster was in a failed state. Now what you are saying is that stonith would be my solution. When I turn off one cluster node and the resource goes into an unmanaged state, the other node could declare it as dead and go online? Can anyone please point me to a stonith how-to which is not based on a UPS or something like this? I also can't much about in in the book from Dr. Schwartzkopff :( This would be really helpful. Thanks in advance, Tobias _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
