On Mar 28, 2008, at 4:36 PM, Niels de Carpentier wrote:


The front003 has a hardware failure, so it is to be expected that the
stonith action will fail. ( This is a custom stonith script, so there
might be some bugs left in it. The xen ocf script is also a custom one
)

The real problem is that it shows 2 resources running on the front003,
while this server is obviously offline. It should move the resources
to one of the other servers, but doesn't for some reason.

How can it?  It's offline remember.
Or at least it _appears_ offline which is the whole point of
STONITH... to make _sure_ its offline before starting the resources
elsewhere.

So until the STONITH command succeeds, the resources wont be moved. They
show up as running on that node because as far as the cluster can
confirm... they still are.

Ok, I can understand the need to make really sure the server is offline. Unfortunately, the stonith reset command will always fail in this case, as
the server is broken and cannot be turned on anymore.

Should the stonith reset command return a success even if the server
cannot be turned on anymore?

No - because it didn't perform the action.
Lie to the cluster and it will always bite you in the end - in this case, when your iLO board (maybe even the network) fails and the node really is still running in some capacity.

This is the only way I can think of to get an
automatic failover in case of a hardware failure.

Then that's not a good stonith agent/hardware setup.
Its the same reason the SSH agent isn't recommended.

There is also no way to tell heartbeat manually that the server is
offline. This means that there seems to be no nice way to recover from
this situation.

remove the node from the cluster?
hb_del_node does that i think
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to