Hello all,
I have a 2-node, STONITH-enabled cluster with v1 config, which is
working pretty fine, including the fencing. The fencing is done via IPMI
to a remote management controller on the motherboard of the server,
which is connected to a switch. When I kill the heartbeat master process
on one node, it is rebooted by the other node.
However, when I perform the "hard takeover" test by unplugging the
cables on the server and leaving them unplugged, the cluster falls into
a loop, because the IPMI reboot fails to succeed (due to the second
server being unplugged) and tries again and again. The logs state that
the ipmitool process returned with error code 256.
I know this issue has been discussed here before, and the conclusion was
that this is basically an unsolvable situation. I am wondering if anyone
else has faced this problem and has come up with a solution? Is it
possible to workaround it by using a timeout for ipmitool or something
similar?
The whole cluster is on a UPS, so the "complete power outage" scenario
is unlikely, but still, this seems to me like a loophole in the STONITH
design. For now, I have simply disabled it.
Regards,
Peter
--
Peter LUCIAK ([email protected])
IBL Software Engineering, http://www.iblsoft.com/
Mierová 103, 82105 Bratislava, Slovakia
Phone: +421-2-32662111, Fax: +421-2-32662110
Direct: +421-2-32662175
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems