On Tuesday 15 June 2010, Dejan Muhamedagic wrote: > Hi, > > On Tue, Jun 15, 2010 at 02:25:51PM -0600, Dan Urist wrote: > > On Tue, 15 Jun 2010 22:08:37 +0200 > > > > Dejan Muhamedagic <deja...@fastmail.fm> wrote: > > > Hi, > > > > > > On Tue, Jun 15, 2010 at 01:15:08PM -0600, Dan Urist wrote: > > > > I've recently had exactly the same thing happen. One (highly > > > > kludgey!) solution I've considered is hacking a custom version of > > > > the stonith IPMI agent that would check whether the node was at all > > > > reachable following a stonith failure via any of the cluster > > > > interfaces reported by cl_status (I have redundant network links), > > > > and then return true (i.e. pretend the stonith succeeded) if it > > > > isn't. Since this is basically the logic I would use if I were > > > > trying to debug the issue remotely, I don't see that this would be > > > > any worse. > > > > > > > > Besides the obvious (potential, however unlikely, for split-brain), > > > > is there any reason this approach wouldn't work? > > > > > > That sounds like a reason good enough to me :) If you can't reach > > > the host, you cannot know its state. > > > > But in my case, if the live node can't reach the suspect node via its > > public network interface, its private bonded interface, or its IPMI > > card (I've added a ping test for that, to determine that it's actually > > unreachable rather than just failing), it seems pretty reasonable for > > me to assume it's really dead at that point. > > Perhaps somebody just pulled the network cables. I understand > that it's not unheard of.
The network driver also may have crashed. And if its shared-NIC-IPMI (*) the network driver also may have brought down IPMI. Of course, I also see the problem of a complete server failure and the need to to automatically recover it. Besides a better stonith device, the only solution I see for it, would be a new parameter to make pacemaker assume a node is dead, if not a single network access succceds even though pacemaker fails. Of course that should default to to off and probably only should be possible to enable by adding something like "really_enable_parameter = I-know-exactly-what-I-do-and-accept-possible-split- brain-and-data-corruption". For example with Lustres multiple-mount-protection split brain *shouldn't* be a problem, although I never trust a single feature only ;) Cheers, Bernd PS: (*) Sales manager who buy those IPMI-shared-NIC solutions and people from companies who sell that, should be punished and should work rotational 24 hours in server rooms and take over the IPMI reset part ;) _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker