Hello all,

I have a 2-node, STONITH-enabled cluster with v1 config, which is working pretty fine, including the fencing. The fencing is done via IPMI to a remote management controller on the motherboard of the server, which is connected to a switch. When I kill the heartbeat master process on one node, it is rebooted by the other node.

However, when I perform the "hard takeover" test by unplugging the cables on the server and leaving them unplugged, the cluster falls into a loop, because the IPMI reboot fails to succeed (due to the second server being unplugged) and tries again and again. The logs state that the ipmitool process returned with error code 256.

I know this issue has been discussed here before, and the conclusion was that this is basically an unsolvable situation. I am wondering if anyone else has faced this problem and has come up with a solution? Is it possible to workaround it by using a timeout for ipmitool or something similar?

The whole cluster is on a UPS, so the "complete power outage" scenario is unlikely, but still, this seems to me like a loophole in the STONITH design. For now, I have simply disabled it.

Regards,
Peter
--
Peter LUCIAK ([email protected])
IBL Software Engineering, http://www.iblsoft.com/
Mierová 103, 82105 Bratislava, Slovakia
Phone: +421-2-32662111, Fax: +421-2-32662110
Direct: +421-2-32662175
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to