On May 14, 2007, at 2:36 PM, Alan Robertson wrote:

Dejan Muhamedagic wrote:
On Fri, May 11, 2007 at 10:04:11AM +0200, Th.Paschy, hepasoft oHG wrote:
Hi all,

I am a new user of heartbeat.

I configure an active/passive cluster with too Dell PE1900 based on SuSE linux with heartbeat 2.0.8 (r1-style). After some problems by DRBD resources after a cold reset of the master node (not removed locks), which was fixed
by Phillipp Reissner last weekend all works fine.

At next I was looking for a stonith module for the Dell Remote Access Controller DRAC 5 but I find only one for the drac3. Inside the drac5 the layout of the embedded Web-Interface have been changed, so the drac3 module
won't work.

So I've write my own module strongly based on the acpmaster module. The module uses the SM-CLP command line interface of the drac5 via telnet. I'm
really not a good C-programmer but it works perfectly.

But there would be one problem (and with the drac3 module it would be too), if the server lost power connection. So the Remote Access Card won't be
accessible and the fencing process will never been stopped and so no
resource take_over take place, unless you manually take corrective action.
So a redundant power supply would be strongly recommend.

I've seen that other users are looking for a drac5 module too, so I've
attached the source of the drac5 module.

Thanks for the contribution! Alan will probably want to the usual
legal chanting.

I'll send you an email on this.

It would be glad, if some one could tell me a way, how I can handle the described problem on never endings fencing, if the access to the drac will
loss (cause of power lost or the network connection would fail).

Unfortunately, there's no workaround. If heartbeat cannot stonith
the node, it will go on trying forever. If stonith is configured,
we must make sure that the node is rebooted or shutdown. If the
stonith device is not accessible, well, too bad. The UPS based
stonith devices are definitely preferable to the lights-out
embedded kind.

Sometime in the past, I asked Andrew for a feature which would allow the takeover to proceed after a certain number of failed STONITHs, if things were configured to allow that. I don't remember whether he did that or not.

It never got implemented.


For these kind of cases, it seems like a good thing.

I disagree - remember the "you can't make it up" part of "You dont know what you dont know" In general, you don't know that the node is dead, only that the stonith device is...

In the case of this plugin, apparently, the stonith device being dead implies the host is also. This makes me inclined to think that the plugin is therefor a "better" place to implement such behavior.

It also means that such a feature could be turned on for individual stonith devices rather than unilaterally - which may not be a good idea especially in mixed-stonith-device environments


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to