On Mon, 2008-12-22 at 19:42 +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Dec 22, 2008 at 12:18:06PM +0100, Tobias Appel wrote:
> > Hi,
> > 
> > sorry to bug you guys again before christmas but I have a very weird
> > error.
> > I have a 2 node setup with drbd and Heartbeat 2.14. One resource group
> > which contains Nagios (something like BigBrother).
> > 
> > Now I configured everything and did some tests with starting and stoping
> > heartbeat service on the servers - the failover worked. 
> > 
> > But if I run 'shutdown -r now' on the active node the server will not
> > reboot and the resource group will not be moved to the passive node. 
> > When I run crm_mon I can see:
> >  nagios-core (lsb:nagios):   Started node01 (unmanaged) FAILED
> > 
> > The server will do nothing then. It will not reboot, the rest of the
> > resource group is still running! The log file from nagios tells me it
> > correctly shutdown. I did browse through the big big ha-log but I
> > couldn't find anything that would help me.
> > 
> > pengine[27246]: 2008/12/22_11:47:11 WARN: unpack_rsc_op: Processing
> > failed op nagios-core_stop_0 on node01: Error
> > 
> > I really have no idea what to look for or what to do. 
> 
> A resource failed to stop. That's typically a reason to kill the
> node, but you probably don't have stonith setup. If a resource
> can't be stopped and there's no stonith enabled, then that
> resource can't be started anywhere.
> 
> Thanks,
> 
> Dejan

Hi,

and happy new year everybody - just came back from holiday.

You are right I don't have stonith enabled because I don't really
understand it fully yet. I know what it means and what it should do but
I thought it works as fencing in conjunction with a UPS or fibre-channel
switch device.

It is correct that the problem is that the resource can not be stopped -
or at least the CRM thinks it can not be stopped. I had the same problem
with the RedHat Cluster Software on the same server - it also could not
stop the nagios resource and the cluster was in a failed state.

Now what you are saying is that stonith would be my solution. When I
turn off one cluster node and the resource goes into an unmanaged state,
the other node could declare it as dead and go online?

Can anyone please point me to a stonith how-to which is not based on a
UPS or something like this? I also can't much about in in the book from
Dr. Schwartzkopff :(
This would be really helpful.

Thanks in advance,

Tobias

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to