Zemke, Kai wrote: > Hi, > > > > I'm running a two node failover cluster. Yesterday the cluster tried to > manage a state transition. In the log files I found the following entries: > > > > heartbeat[6905]: 2009/02/10_21:45:55 WARN: node nagios-drbd2: is dead > > heartbeat[6905]: 2009/02/10_21:45:55 info: Link nagios-drbd2:eth1 dead. > > > > A few minutes later the node that was still alive tried to take over the > resources and created the following entries in the log file ( the resource > "ipaddress" is an example, there are a lot more entries for the other > resources that were running on the cluster ): > > > > pengine[7370]: 2009/02/10_21:45:59 WARN: custom_action: Action > resource_nagios_ipaddress_stop_0 on nagios-drbd2 is unrunnable (offline) > > pengine[7370]: 2009/02/10_21:45:59 WARN: custom_action: Marking node > nagios-drbd2 unclean > > > > Further more there a several entries telling: > > > > stonithd[6916]: 2009/02/10_21:46:30 ERROR: Failed to STONITH the node > nagios-drbd2: optype=RESET, op_result=TIMEOUT > > > > The stonith is running via ssh on a direct link between the to nodes. Since > Node2 was down the shutdown command never reached its destination.
Which is why ssh stonith is not meant for production. > My Questions are: > > Why did the alive cluster try to stop resources on a cluster node that is > considered as dead? > > Why did STONITH try to shut down a node that is considered down? ( for safety > reasons I think ) It is considered dead, but that does not have to be a fact. By shooting it, the cluster makes the assumption a fact (turn it off or reboot it). > Shouldn't the resources just be started on the alive node without any further > action? Not until the cluster "knows" the other node is dead. Who knows what's going on there if it cannot be communicated with. > Did I miss something in the default behaviour of heartbeat? Maybe a timeout? > > Would a hardware STONITH device solve such problems in the future? Yes. Regards Dominik _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems